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Abstract 

In this paper we introduce a novel framework for making exact nonparametric Bayesian inference 
on latent functions that is particularly suitable for Big Data tasks. Firstly, we introduce a class 
of stochastic processes we refer to as string Gaussian processes (string GPs which are not to be 
mistaken for Gaussian processes operating on text). We construct string GPs so that their finite¬ 
dimensional marginals exhibit suitable local conditional independence structures, which allow for 
scalable, distributed, and flexible nonparametric Bayesian inference, without resorting to approxi¬ 
mations, and while ensuring some mild global regularity constraints. Furthermore, string GP priors 
naturally cope with heterogeneous input data, and the gradient of the learned latent function is read¬ 
ily available for explanatory analysis. Secondly, we provide some theoretical results relating our 
approach to the standard GP paradigm. In particular, we prove that some string GPs are Gaussian 
processes, which provides a complementary global perspective on our framework. Finally, we de¬ 
rive a scalable and distributed MCMC scheme for supervised learning tasks under string GP priors. 
The proposed MCMC scheme has computational time complexity O(N) and memory requirement 
O(dN), where N is the data size and d the dimension of the input space. We illustrate the efficacy 
of the proposed approach on several synthetic and real-world data sets, including a data set with 6 
millions input points and 8 attributes. 

Keywords: String Gaussian processes, scalable Bayesian nonparametrics, Gaussian processes, 
nonstationary kernels, reversible-jump MCMC, point process priors 


1. Introduction 

Many problems in statistics and machine learning involve inferring a latent function from train¬ 
ing data (for instance regression, classification, inverse reinforcement learning, inference on point 
processes to name but a few). Real-valued stochastic processes, among which Gaussian processes 
(GPs), are often used as functional priors for such problems, thereby allowing for a full Bayesian 
nonparametric treatment. In the machine learning community, interest in GPs grew out of the ob¬ 
servation that some Bayesian neural networks converge to GPs as the number of hidden units ap¬ 
proaches infinity (Neal (1996)). Since then, other similarities have been established between GPs 
and popular models such as Bayesian linear regression, Bayesian basis function regression, spline 
models and support vector machines (Rasmussen and Williams (2006)). Flowever, they often per¬ 
form poorly on Big Data tasks primarily for two reasons. Firstly, large data sets are likely to exhibit 
multiple types of local patterns that should appropriately be accounted for by flexible and possibly 
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nonstationary covariance functions, the development of which is still an active subject of research. 
Secondly, inference under GP priors often consists of looking at the values of the GP at all input 
points as a jointly Gaussian vector with fully dependent coordinates, which induces a memory re¬ 
quirement and time complexity respectively squared and cubic in the training data size, and thus 
is intractable for large data sets. We refer to this approach as the standard GP paradigm. The 
framework we introduce in this paper addresses both of the above limitations. 

Our work is rooted in the observation that, from a Bayesian nonparametric perspective, it is 
inefficient to define a stochastic process through fully-dependent marginals, as it is the case for 
Gaussian processes. Indeed, if a stochastic process (/(x)) xg]R d has fully dependent marginals and 
exhibits no additional conditional independence structure then, when / is used as functional prior 
and some observations related to ij{x \)...., f(x n )) are gathered, namely (y\, . . . , y n ), the addi¬ 
tional memory required to take into account an additional piece of information (y n +i,x n +i) grows 
in (D(n), as one has to keep track of the extent to which y n +i informs us about /(.x,) for every 
i < n, typically through a covariance matrix whose size will increase by 2n + 1 terms. Clearly, 
this is inefficient, as y n +1 is unlikely to be informative about f(x t ). unless :x t is sufficiently close 
to x n+ i. More generally, the larger n, the less information a single additional pair (y n +i, x n +i) 
will add to existing data, and yet the increase in memory requirement will be much higher than that 
required while processing earlier and more informative data. This inefficiency in resource require¬ 
ments extends to computational time, as the increase in computational time resulting from adding 
(y n +i, x n+ i) typically grows in O(rr ), which is the difference between the numbers of operations 
required to invert an x n matrix and to invert a (n + 1) x (n + 1) matrix. A solution for addressing 
this inefficiency is to appropriately limit the extent to which values f(x i),..., f(x n ) are related 
to each other. Existing approaches such as sparse Gaussian processes (see Quinonero-Candela and 
Rasmussen (2005) for a review), resort to an ex-post approximation of fully-dependent Gaussian 
marginals with multivariate Gaussians exhibiting conditional independence structures. Unfortu¬ 
nately, these approximations trade-off accuracy for scalability through a control variable, namely 
the number of inducing points, whose choice is often left to the user. The approach we adopt in 
this paper consists of going back to stochastic analysis basics, and constructing stochastic processes 
whose finite-dimensional marginals exhibit suitable conditional independence structures so that we 
need not resorting to ex-post approximations. Incidentally, unlike sparse GP techniques, the condi¬ 
tional independence structures we introduce also allow for flexible and principled learning of local 
patterns, and this increased flexibility does not come at the expense of scalability. 

The contributions of this paper are as follows. We introduce a novel class of stochastic pro¬ 
cesses, string Gaussian processes ( string GPs ), that may be used as priors over latent functions 
within a Bayesian nonparametric framework, especially for large scale problems and in the presence 
of possibly multiple types of local patterns. We propose a framework for analysing the flexibility 
of random functions and surfaces, and prove that our approach yields more flexible stochastic pro¬ 
cesses than isotropic Gaussian processes. We demonstrate that exact inference under a string GP 
prior scales considerably better than in the standard GP paradigm, and is amenable to distributed 
computing. We illustrate that popular stationary kernels can be well approximated within our frame¬ 
work, making string GPs a scalable alternative to commonly used GP models. We derive the joint 
law of a string GP and its gradient, thereby allowing for explanatory analysis on the learned latent 
function. We propose a reversible-jump Markov Chain Monte Carlo sampler for automatic learning 
of model complexity and local patterns from data. 
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The rest of the paper is structured as follows. In Section 2 we review recent advances on Gaus¬ 
sian processes in relation to inference on large data sets. In Section 3 we formally construct string 
GPs and derive some important results. In Section 4 we provide detailed illustrative and theoretical 
comparisons between string GPs and the standard GP paradigm. In Section 5 we propose methods 
for inferring latent functions under string GP priors with time complexity and memory requirement 
that are linear in the size of the data set. The efficacy of our approach compared to competing 
alternatives is illustrated in Section 6. Finally, we finish with a discussion in Section 7. 

2. Related Work 

The two primary drawbacks of the standard GP paradigm on large scale problems are the lack 
of scalability resulting from postulating a full multivariate Gaussian prior on function values at 
all training inputs, and the difficulty postulating a priori a class of covariance functions capable 
of capturing intricate and often local patterns likely to occur in large data sets. A tremendous 
amount of work has been published that attempt to address either of the aforementioned limitations. 
However, scalability is often achieved either through approximations or for specific applications, 
and nonstationarity is usually introduced at the expense of scalability, again for specific applications. 

2.1 Scalability Through Structured Approximations 

As far as scalability is concerned, sparse GP methods have been developed that approximate the 
multivariate Gaussian probability density function (pdf) over training data with the marginal over 
a smaller set of inducing points multiplied by an approximate conditional pdf (Smola and Bartlett 
(2001); Lawrence et al. (2003); Seeger (2003b, a); Snelson and Ghahramani (2006)). This approx¬ 
imation yields a time complexity linear—rather than cubic—in the data size and squared in the 
number of inducing points. We refer to Quinonero-Candela and Rasmussen (2005) for a review 
of sparse GP approximations. More recently, Hensman et al. (2013, 2015) combined sparse GP 
methods with Stochastic Variational Inference (Hoffman et al. (2013)) for GP regression and GP 
classification. However, none of these sparse GP methods addresses the selection of the number 
of inducing points (and the size of the minibatch in the case of Hensman et al. (2013, 2015)), al¬ 
though this may greatly affect scalability. More importantly, although these methods do not impose 
strong restrictions on the covariance function of the GP model to approximate, they do not address 
the need for flexible covariance functions inherent to large scale problems, which are more likely 
to exhibit intricate and local patterns, and applications considered by the authors typically use the 
vanilla squared exponential kernel. 

Lazaro-Gredilla et al. (2010) proposed approximating stationary kernels with truncated Fourier 
series in Gaussian process regression. An interpretation of the resulting sparse spectrum Gaussian 
process model as Bayesian basis function regression with a finite number K of trigonometric basis 
functions allows making inference in time complexity and memory requirement that are both lin¬ 
eal - in the size of the training sample. However, this model has two major drawbacks. Firstly, it 
is prone to over-fitting. In effect, the learning machine will aim at inferring the K major spectral 
frequencies evidenced in the training data. This will only lead to appropriate prediction out-of- 
sample when the underlying latent phenomenon can be appropriately characterised by a finite dis¬ 
crete spectral decomposition that is expected to be the same everywhere on the domain. Secondly, 
this model implicitly postulates that the covariance between the values of the GP at two points does 
not vanish as the distance between the points becomes arbitrarily large. This imposes a priori the 
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view that the underlying function is highly structured, which might be unrealistic in many real-life 
non-periodic applications. This approach is generalised by the so-called random Fourier features 
methods (Rahimi and Recht (2007); Le et al. (2013); Yang et al. (2015)). Unfortunately all existing 
random Fourier features methods give rise to stationary covariance functions, which might not be 
appropriate for data sets exhibiting local patterns. 

The bottleneck of inference in the standard GP paradigm remains inverting and computing 
the determinant of a covariance matrix, normally achieved through the Cholesky decomposition or 
Singular Value Decomposition. Methods have been developed that speed-up these decompositions 
through low rank approximations (Williams and Seeger (2001)) or by exploiting specific structures 
in the covariance function and in the input data (Saatchi (2011); Wilson et al. (2014)), which typi¬ 
cally give rise to Kronecker or Toeplitz covariance matrices. While the Kronecker method used by 
Saatchi (2011) and Wilson et al. (2014) is restricted to inputs that form a Cartesian grid and to sepa¬ 
rable kernels, 1 low rank approximations such as the Nystrom method used by Williams and Seeger 
(2001) modify the covariance function and hence the functional prior in a non-trivial way. Methods 
have also been proposed to interpolate the covariance matrix on a uniform or Cartesian grid in order 
to benefit from some of the computational gains of Toeplitz and Kronecker techniques even when 
the input space is not structured (Wilson and Nickisch (2015)). Flowever, none of these solutions is 
general as they require that either the covariance function be separable (Kronecker techniques), or 
the covariance function be stationary and the input space be one-dimensional (Toeplitz techniques). 

2.2 Scalability Through Data Distribution 

A family of methods have been proposed to scale-up inference in GP models that are based on the 
observation that it is more computationally efficient to compute the pdf of K independent small 
Gaussian vectors with size n than to compute the pdf of a single bigger Gaussian vector of size nK. 
For instance, Kim et al. (2005) and Gramacy and Lee (2008) partitioned the input space, and put 
independent stationary GP priors on the restrictions of the latent function to the subdomains forming 
the partition, which can be regarded as independent local GP experts. Kim et al. (2005) partitioned 
the domain using Voronoi tessellations, while Gramacy and Lee (2008) used tree based partitioning. 
These two approaches are provably equivalent to postulating a (nonstationary) GP prior on the whole 
domain that is discontinuous along the boundaries of the partition, which might not be desirable if 
the latent function we would like to infer is continuous, and might affect predictive accuracy. The 
more local experts there are, the more scalable the model will be, but the more discontinuities the 
latent function will have, and subsequently the less accurate the approach will be. 

Mixtures of Gaussian process experts models (MoE) (Tresp (2001); Rasmussen and Ghahramani 
(2001); Meeds and Osindero (2006); Ross and Dy (2013)) provide another implementation of this 
idea. MoE models assume that there are multiple latent functions to be inferred from the data, 
on which it is placed independent GP priors, and each training input is associated to one latent 
function. The number of latent functions and the repartition of data between latent functions can 
then be performed in a full Bayesian nonparametric fashion (Rasmussen and Ghahramani (2001); 
Ross and Dy (2013)). When there is a single continuous latent function to be inferred, as it is the 
case for most regression models, the foregoing Bayesian nonparametric approach will learn a single 
latent function, thereby leading to a time complexity and a memory requirement that are the same 
as in the standard GP paradigm , which defies the scalability argument. 

1. That is multivariate kernel that can be written as product of univariate kernels. 
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The last implementation of the idea in this section consists of distributing the training data 
over multiple independent but identical GP models. In regression problems, examples include the 
Bayesian Committee Machines (BCM) of Tresp (2000), the generalized product of experts (gPoE) 
model of Cao and Fleet (2014), and the robust Bayesian Committee Machines (rBCM) of Deisen- 
roth and Ng (2015). These models propose splitting the training data in small subsets, each subset 
being assigned to a different GP regression model—referred to as an expert—that has the same 
hyper-parameters as the other experts, although experts are assumed to be mutually independent. 
Training is performed by maximum marginal likelihood, with time complexity (resp. memory re¬ 
quirement) linear in the number of experts and cubic (resp. squared) in the size of the largest 
data set processed by an expert. Predictions are then obtained by aggregating the predictions of 
all GP experts in a manner that is specific to the method used (that is the BCM, the gPoE or 
the rBCM). However, these methods present major drawbacks in the training and testing proce¬ 
dures. In effect, the assumption that experts have identical hyper-parameters is inappropriate for 
data sets exhibiting local patterns. Even if one would allow GP experts to be driven by different 
hyper-parameters as in Nguyen and Bonilla (2014) for instance, learned hyper-parameters would 
lead to overly simplistic GP experts and poor aggregated predictions when the number of training 
inputs assigned to each expert is small—this is a direct consequence of the (desirable) fact that 
maximum marginal likelihood GP regression abides by Occam's razor. Another critical pitfall of 
BCM, gPoE and rBCM is that their - methods for aggregating expert predictions are Kolmogorov 
inconsistent. For instance, denoting p the predictive distribution in the BCM, it can be easily 
seen from Equations (2.4) and (2.5) in Tresp (2000) that the predictive distribution p(f(x\)\'D) 
(resp. p(f(x 2 )|^)) 2 provided by the aggregation procedure of the BCM is not the marginal over 
f( x 2 ) (resp. over /(#*)) of the multivariate predictive distribution p(f(x\),f(x 2 )\D) obtained 
from experts multivariate predictions Pk(f(x\), /(x|)|P) using the same aggregation procedure: 
p(f( x i)I'D) 7^ f P(f( x i), f{x* 2 )\V)df (x* 2 ). Without Kolmogorov consistency, it is impossible to 
make principled Bayesian inference of latent function values. A principled Bayesian nonparametric 
model should not provide predictions about f(xf) that differ depending on whether or not one is 
also interested in predicting other values f(x*) simultaneously. This pitfall might be the reason 
why Cao and Fleet (2014) and Deisenroth and Ng (2015) restricted their expositions to predictive 
distributions about a single function value at a time p(f(x*)\V), although their procedures (Equa¬ 
tion 4 in Cao and Fleet (2014) and Equation 20 in Deisenroth and Ng (2015)) are easily extended to 
posterior distributions over multiple function values. These extensions would also be Kolmogorov 
inconsistent, and restricting the predictions to be of exactly one function value is unsatisfactory as 
it does not allow determining the posterior covariance between function values at two test inputs. 

2.3 Expressive Stationary Kernels 

In regards to flexibly handling complex patterns likely to occur in large data sets, Wilson and Adams 
(2013) introduced a class of expressive stationary kernels obtained by summing up convolutions of 
Gaussian basis functions with Dirac delta functions in the spectral domain. The sparse spectrum 
kernel can be thought of as the special case where the convolving Gaussian is degenerate. Although 
such kernels perform particularly well in the presence of globally repeated patterns in the data, their 
stationarity limits their utility on data sets with local patterns. Moreover the proposed covariance 


2. Here / is the latent function to be inferred, x*,X 2 are test points and V denotes training data. 
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functions generate infinitely differentiable random functions, which might be too restrictive in some 
applications. 

2.4 Application-Specific Nonstationary Kernels 

As for nonstationary kernels, Paciorek and Schervish (2004) proposed a method for constructing 
nonstationary covariance functions from any stationary one that involves introducing n input de¬ 
pendent d X d covariance matrices that will be inferred from the data. Plagemann et al. (2008) 
proposed a faster approximation to the model of Paciorek and Schervish (2004). However, both 
approaches scale poorly with the input dimension and the data size as they have time complexity 
O (max(nd 3 ,n 3 )). MacKay (1998), Schmidt and O’Hagan (2003), and Calandra et al. (2014) 
proposed kernels that can be regarded as stationary after a non-linear transformation d on the 
input space: k(x,x') = h (||d(x) — d(x')||), where h is positive semi-definite. Although for a 
given deterministic function d the kernel k is nonstationary, Schmidt and O’Hagan (2003) put a 
GP prior on d with mean function rn(x) = x and covariance function invariant under translation, 
which unfortunately leads to a kernel that is (unconditionally) stationary, albeit more flexible than 
h (||a; — x'\\) . To model nonstationarity, Adams and Stegle (2008) introduced a functional prior of 
the form y(x) = f(x) exp g(x) where / is a stationary GP and g is some scaling function on the 
domain. For a given non-constant function g such a prior indeed yields a nonstationary Gaussian 
process. However, when a stationary GP prior is placed on the function g as Adams and Stegle 
(2008) did, the resulting functional prior y(x) = f(x) exp g(x) becomes stationary. The piece- 
wise GP (Kim et al. (2005)) and treed GP (Gramacy and Lee (2008)) models previously discussed 
also introduce nonstationarity. The authors’ premise is that heterogeneous patterns might be lo¬ 
cally homogeneous. However, as previously discussed such models are inappropriate for modelling 
continuous latent functions. 

2.5 Our Approach 

The approach we propose in this paper for inferring latent functions in large scale problems, possi¬ 
bly exhibiting locally homogeneous patterns, consists of constructing a novel class of smooth , non¬ 
stationary and flexible stochastic processes we refer to as string Gaussian processes ( string GPs ), 
whose finite dimensional marginals are structured enough so that full Bayesian nonparametric in¬ 
ference scales linearly with the sample size, without resorting to approximations. Our approach is 
analogous to MoE models in that, when the input space is one-dimensional, a string GP can be re¬ 
garded as a collaboration of local GP experts on non-overlapping supports, that implicitly exchange 
messages with one another, and that are independent conditional on the aforementioned messages. 
Each local GP expert only shares just enough information with adjacent local GP experts for the 
whole stochastic process to be sufficiently smooth (for instance continuously differentiable), which 
is an important improvement over MoE models as the latter generate discontinuous latent functions. 
These messages will take the form of boundary conditions, conditional on which each local GP 
expert will be independent from any other local GP expert. Crucially, unlike the BCM, the gPoE 
and the rBCM, we do not assume that local GP experts share the same prior structure (that is mean 
function, covariance function, or hyper-parameters). This allows each local GP expert to flexibly 
learn local patterns from the data if there are any, while preserving global smoothness, which will 
result in improved accuracy. Similarly to MoEs, the computational gain in our approach stems from 
the fact that the conditional independence of the local GP experts conditional on shared boundary 
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conditions will enable us to write the joint distribution over function and derivative values at a large 
number of inputs as the product of pdfs of much smaller Gaussian vectors. The resulting effect on 

time complexity is a decrease from 0(N 3 ) to 0(max nf), where N = Yk n ki 71 k ^ N. In fact, 

k 

in Section 5 we will propose Reversible-Jump Monte Carlo Markov Chain (RJ-MCMC) inference 
methods that achieve memory requirement and time complexity O(N), without any loss of flexi¬ 
bility. All these results are preserved by our extension of string GPs to multivariate input spaces, 
which we will occasionally refer to as membrane Gaussian processes (or membrane GPs). Unlike 
the BCM, the gPoE and the rBCM, the approach we propose in this paper, which we will refer to as 
the string GP paradigm, is Kolmogorov consistent, and enables principled inference of the posterior 
distribution over the values of the latent function at multiple test inputs. 

3. Construction of String and Membrane Gaussian Processes 

In this section we formally construct string Gaussian processes, and we provide some important 
theoretical results including smoothness, and the joint law of string GPs and their gradients. We 
construct string GPs indexed on M, before generalising to string GPs indexed on M f/ , which we will 
occasionally refer to as membrane GPs to stress that the input space is multivariate. We start by 
considering the joint law of a differentiable GP on an interval and its derivative, and introducing 
some related notions that we will use in the construction of string GPs. 

Proposition 1 (Derivative Gaussian processes) 

Let I be an interval, k : I x / Ala C~ symmetric positive semi-definite function? 

C 1 function. 

(A) There exists nM 2 -valued stochastic process {D t ) t€l , D t = (zt, z' t ), such that for all t\,... ,t n G 

I, (zt t ,..., Zt n , z' t ,..., Zf ) is a Gaussian vector with mean ..., m(t n ), yjrifi), ..., yff{t n )) 

and covariance matrix such that 

dk d 2 k 

cov(z ti ,z tj ) = k(U,tj), cov(z ti ,z ' t .) = and cov{z' t .,z' t .) = 

where (resp. refers to the partial derivative with respect to the first (resp. second) variable 
of k. We herein refer to (. Dt)tei as a derivative Gaussian process. 

(B) ( zt)tei is a Gaussian process with mean function m, covariance function k and that is C 1 in the 
L 2 (mean square) sense. 

(C) (z' t )t£i is a Gaussian process with mean function and covariance function q~q . Moreover, 

( z' t )t£i is the L 2 derivative of the process ( Zt)tel■ 

Proof Although this result is known in the Gaussian process community, we provide a proof for 
the curious reader in Appendix B. ■ 

We will say of a kernel k that it is degenerate at a when a derivative Gaussian process ( zt , 

3. C 1 (resp. C 2 ) functions denote functions that are once (resp. twice) continuously differentiable on their domains. 
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with kernel k is such that z a and z' a are perfectly correlated, 4 5 that is 

|corr(z a , z' a )\ = 1. 

As an example, the linear kernel k(u, v ) = cr 2 {u — c)(y — c) is degenerate at 0. Moreover, we 
will say of a kernel k that it is degenerate at b given a when it is not degenerate at a and when 
the derivative Gaussian process (zt,z' t )t£i with kernel k is such that the variances of z b and z' b 
conditional on ( z a , z’ a ) are both zero/ For instance, the periodic kernel proposed by MacKay (1998) 
with period T is degenerate at u + T given u. 

An important subclass of derivative Gaussian processes in our construction are the processes 
resulting from conditioning paths of a derivative Gaussian process to take specific values at certain 
times (/...., t c ). We herein refer to those processes as conditional derivative Gaussian process. 
As an illustration, when k is C 3 on I x I with / = [a, b], and neither degenerate at a nor degenerate 
at b given a, the conditional derivative Gaussian process on I = [a, b] with unconditional mean 
function m and unconditional covariance function k that is conditioned to start at (z a , z' a ) is the 
derivative Gaussian process with mean function 


V t <E /, m a c (t\ z a , z' a ) = m(t) + K (;a K a . 4 
and covariance function kk that reads 


z a - m(a) 


where K u; „ = 


v t, s G I, K(t, s ) = k{t, s) - K t;Q K-iK^ 

dk i 


, and K t;a = 


k(u,v) v) 

process is conditioned to start at (z a , z' a ) and to end at z' h ). the mean function reads 


Ht,a) || (t, a) 


, CD 

( 2 ) 

. Similarly, when the 


m a c ' b {t-za,z' a ,z b ,z' b ) 




' z a - m(a)' 

*«-$(“) 

z b - m{b) 


(3) 


and the covariance function kr ,b reads 


v t, s e I, s) = k(t, s ) - K £;(a>b) K (a ; 6);(aifc) K; ;(a>6) , 


(4) 


where K (aift);(0i6) 


K a;a 

K b -a 


K a -b 

K kb 


and 


K f 


(a,b) 


[K £;o K/./J . It is important to note that 


both K a;a and K( a ib )-( a ,b) are indeed invertible because the kernel is assumed to be neither degenerate 
at a nor degenerate at b given a. Flence, the support of (z a , z' a , z b , z' h ) is M 4 , and any function and 
derivative values can be used for conditioning. Figure 1 illustrates example independent draws from 
a conditional derivative Gaussian process. 


4. Or equivalently when the Gaussian vector (z a , z' a ) is degenerate. 

5. Or equivalently when the Gaussian vector (z 0 , z' a ) is not degenerate but {z a , z' a , z b , z' b ) is. 
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3.1 String Gaussian Processes on M 

The intuition behind string Gaussian processes on an interval comes from the analogy of collab¬ 
orative local GP experts we refer to as strings that arc connected but independent of each other 
conditional on some regularity boundary conditions. While each string is tasked with representing 
local patterns in the data, a string only shares the states of its extremities (value and derivative) 
with adjacent strings. Our aim is to preserve global smoothness and limit the amount of informa¬ 
tion shared between strings, thus reducing computational complexity. Furthermore, the conditional 
independence between strings will allow for distributed inference, greater flexibility and principled 
nonstationarity construction. 

The following theorem at the core of our framework establishes that it is possible to connect 
together GPs on a partition of an interval I, in a manner consistent enough that the newly constructed 
stochastic object will be a stochastic process on I and in a manner restrictive enough that any two 
connected GPs will share just enough information to ensure that the constructed stochastic process 
is continuously differentiable (C 1 ) on I in the L 2 sense. 

Theorem 2 (String Gaussian process) 

Let ao <■■■< a/- <■■■ < ok, I = [ao> a/v] and letpy/{x', p, E) be the multivariate Gaussian den¬ 
sity with mean vector p and covariance matrix E. Furthermore, let (rrik : [ah- 1 , o^] -> K) fcg [ 1 K \ 
be C 1 functions, and (kk : [afc-i,a/c] x [«fc-i)Ofc] ► lR)fce[i...K'] be C 3 symmetric positive semi- 
definite functions, neither degenerate at ak-i, nor degenerate at ak given cik- 1 - 

(A) There exists an M 2 -valued stochastic process ( SD t )t&p SD t = ( zt , z'f) satisfying the following 
conditions: 

1) The probability density of (SD ao ,..., SD aii ) reads: 

I< 

p b {x o, • • •, X K ) := PM ( 'x k \p b k , E *k) (5) 

fc=o 


where: 


yb _ 
2^0 — 


lK, 


a o;“o> 


V k > 0 


n = 


kKa k ;a k A- A a fc ;a fc _i 


— l-K,, 


kK~ 


vKi 


( 6 ) 


Pq — 1 Ma 0 , V k > 0 Pk — kMa k + kKa k ;a k - 1 k^a k _ 1 ;a k - 1 ( x k-l k^a k _ 1 ), 


(V) 


With IcKirv = 


h(u,v) f(u,v) 

°-t («.«) Ss (“.*>) 


and kM„ = 


mfyu) 

L^W 


2) Conditional on ( SD ak = Xk)ke[o..K]> the restrictions (SD t ) te i ak k G [l..iT] are indepen¬ 
dent conditional derivative Gaussian processes, respectively with unconditional mean function mk 
and unconditional covariance function kk and that are conditioned to take values Xk-i and Xk at 
(Ik -1 and ak respectively. We refer to ( SDt)tei as a string derivative Gaussian process, and to its 
first coordinate ( zf)tei as a string Gaussian process namely, 


(■ zt)tei ~ SGV({a k }, {m k }, {k k })- 

(B) The string Gaussian process ( zf)tei defined in (A) is C 1 in the if sense and its L 2 derivative is 
the process (z' t )tel defined in (A). 
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Proof See Appendix C. 


In our collaborative local GP experts analogy, Theorem 2 stipulates that each local expert takes 
as message from the previous expert its left hand side boundary conditions, conditional on which 
it generates its right hand side boundary conditions, which it then passes on to the next expert. 
Conditional on their boundary conditions local experts are independent of each other, and resemble 
vibrating pieces of string on fixed extremities, hence the name string Gaussian process. 

3.2 Pathwise Regularity 

Thus far we have dealt with regularity only in the L 2 sense. However, we note that a sufficient 
condition for the process ( z[)t^i in Theorem 2 to be almost surely continuous (i.e. sample paths are 
continuous with probability 1) and to be the almost sure derivative of the string Gaussian process 
( zt)t£i , is that the Gaussian processes on Ik = [a,k-i,ak] with mean and covariance functions 
m ck~ 1,ak anc * Kr’ ak ( as P er Eq ua ti° ns 3 and 4 with m := nip. and k := kp.) are themselves almost 
surely C 1 for every boundary condition. 6 We refer to (Adler and Taylor, 2011, Theorem 2.5.2) for a 
sufficient condition under which a C 1 in L 2 Gaussian process is also almost surely C 1 . As the above 
question is provably equivalent to that of the almost sure continuity of a Gaussian process (see Adler 
and Taylor, 2011, p. 30), Kolmogorov’s continuity theorem (see 0ksendal, 2003, Theorem 2.2.3) 
provides a more intuitive, albeit stronger, sufficient condition than that of (Adler and Taylor, 2011, 
Theorem 2.5.2). 

3.3 Illustration 

Algorithm 1 illustrates sampling jointly from a string Gaussian process and its derivative on an in¬ 
terval / = [oo, or-]. We start off by sampling the string boundary conditions (z ak , z' ak ) sequentially, 
conditional on which we sample the values of the stochastic process on each string. This we may 
do in parallel as the strings are independent of each other conditional on boundary conditions. The 
resulting time complexity is the sum of 0(max n\) for sampling values within strings, and O(n) 
for sampling boundary conditions, where the sample size is n = n^-. The memory requirement 

grows as the sum of n|,). required to store conditional covariance matrices of the values 

within strings, and O(K) corresponding to the storage of covariance matrices of boundary condi¬ 
tions. In the special case where strings are all empty, that is inputs and boundary times are the same, 
the resulting time complexity and memory requirement are 0(n). Figure 2 illustrates a sample from 
a string Gaussian process, drawn using this approach. 

3.4 String Gaussian Processes on W l 

So far the input space has been assumed to be an interval. We generalise string GPs to hyper¬ 
rectangles in M' / as stochastic processes of the form: 



( 10 ) 


where the link function f —> M is a C 1 function and (z 3 t ) are d independent (_L) latent string 

Gaussian processes on intervals. We will occasionally refer to string GPs indexed on with ri > 1 
as membrane GPs to avoid any ambiguity. We note that when d = 1 and when the link function 


6. The proof is provided in Appendix D. 
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0.0 0.2 0.4 0.6 0.8 1.0 


Figure 1: Draws from a conditional derivative GP conditioned to start at 0 with derivative 0 and 
to finish at 1.0 with derivative 0.0. The unconditional kernel is the squared exponential 
kernel with variance 1.0 and input scale 0.2. 



0.0 0.5 1.0 1.5 2.0 2.5 3.0 


Figure 2: Draw from a string GP ( zt) with 3 strings and its derivative under squared expo¬ 
nential kernels (green and yellow strings), and the periodic kernel of MacKay (1998) (red 
string). 
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Algorithm 1 Simulation of a string derivative Gaussian process 

Inputs: boundary times a o < ■ ■ ■ < ax, string times {tj G]afe_i, afc[}je[i..n fc ],fce[i...K> uncondi¬ 
tional mean (resp. covariance) functions rn k (resp. A:/,.) 

Output. {. . . , Z ak , Z ak , . . • , Zj.k , Z^k )•••}■ 

Step 1: sample the boundary conditions sequentially, 
for k = 0 to K do 

Sample (z ak ,z' ak ) ~ Af(^fi b k , E^, with and E| as per Equations (7) and (6). 

end for 

Step 2: sample the values on each string conditional on the boundary conditions in parallel. 
parfor k = 1 to K do 

Let and be as in Theorem 2, 


i-A = 




I’I£+fc -n, . £-IC//c ■n 

^ 1 'n k i a k — 1 ^ tn k i a k 


Mfc = 


fcMjfc 




+ fcA 


k^^-Clk — liQ'k—l k^^-dk — i : dk 

. fcKa fc ;a*,_i fcK afe;a; . 


Za k —i ^YlkiP'k— l) 

4., - 

. 4-^(4 J 


-i 








II 

W 

l 

T' ■ 

?r 

.+k 

n k’ n k J 

- fcA 

_fc K t* fc ;a fc _i 

A;K *S fe ;«fc_ 


-i T 


Sample ( ^, z' fc ..., ,4 ) ~ AA (/r|, E|). 

\ 1 n k J 


end parfor 


( 8 ) 


(9) 
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is (f>{x) = x, we recover string GPs indexed on an interval as previously defined. When the string 
GPs ( z 3 t ) are a.s. C 1 , the membrane GP f in Equation (10) is also a.s. C l , and the partial derivative 
with respect to the j -th coordinate reads: 



( 11 ) 


Thus in high dimensions, string GPs easily allow an explanation of the sensitivity of the learned 
latent function to inputs. 

3.5 Choice of Link Function 

Our extension of string GPs to R d departs from the standard GP paradigm in that we did not 
postulate a covariance function on R d x R d directly. Doing so usually requires using a metric on 
M d , which is often problematic for heterogeneous input dimensions, as it introduces an arbitrary 
comparison between distances in each input dimension. This problem has been partially addressed 
by approaches such as Automatic Relevance Determination (ARD) kernels, that allow for a lin¬ 
ear rescaling of input dimensions to be learned jointly with kernel hyper-parameters. However, 
inference under a string GP prior can be thought of as learning a coordinate system in which the 
latent function / resembles the link function f through non-linear rescaling of input dimensions. 
In particular, when f is symmetric, the learned univariate string GPs (being interchangeable in f) 
implicitly aim at normalizing input data across dimensions, making string GPs naturally cope with 
heterogeneous data sets. 

An important question arising from our extension is whether or not the link function o needs to 
be learned to achieve a flexible functional prior. The flexibility of a string GP as a functional prior 
depends on both the link function and the covariance structures of the underlying string GP building 
blocks (z(). To address the impact of the choice of f on flexibility, we constrain the string GP 
building blocks by restricting them to be independent identically distributed string GPs with one 
string each (i.e. ( z 3 t ) are i.i.d Gaussian processes). Furthermore, we restrict ourselves to isotropic 
kernels as they provide a consistent basis for putting the same covariance structure in M and R d . 
One question we might then ask, for a given link function d>o, is whether or not an isotropic GP 
indexed on R d with covariance function k yields more flexible random surfaces than the stationary 
string GP f(t\,... ,td) = ^o(a*i j • ■ •, z d ), where (zf.) are stationary GPs indexed on M with the 
same covariance function k. If we find a link function <po generating more flexible random surfaces 
than isotropic GP counterparts it would suggest f need not to be inferred in dimension d > 1 to 
be more flexible than any GP using one of the large number of commonly used isotropic kernels, 
among which squared exponential kernels, rational quadratic kernels, and Matern kernels to name 
but a few. 

Before discussing whether such a fo exists, we need to introduce a rigorous meaning to ‘flexi¬ 
bility’. An intuitive qualitative definition of the flexibility of a stochastic process indexed on R d is 
the ease with which it can generate surfaces with varying shapes from one random sample to another 
independent one. We recall that the tangent hyperplane to a C 1 surface y — f(x) = 0, x £ R d at 
some point xo = (t®,... ,t®) has equation V/(x o) T (x — xq) — (y — f(x o)) = 0 and admits as nor¬ 
mal vector (J^(ii), ■ ■ • • -)j d (ij) • — 1)- As tangent hyperplanes approximate a surface locally, a first 
criterion of flexibility for a random surface y — f(x) = 0, x <G W 1 is the proclivity of the (random) 
direction of its tangent hyperplane at any point x —and hence the proclivity of V/(x)—to vary. 
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This criterion alone, however, does not capture the difference between the local shapes of the ran¬ 
dom surface at two distinct points. A complementary second criterion of flexibility is the proclivity 
of the (random) directions of the tangent hyperplanes at any two distinct points xq, x\ E — and 
hence the proclivity of V/(xo) and V f(x ))—to be independent. The first criterion can be measured 
using the entropy of the gradient at a point, while the second criterion can be measured through the 
mutual information between the two gradients. The more flexible a stochastic process, the higher 
the entropy of its gradient at any point, and the lower the mutual information between its gradients 
at any two distinct points. This is formalised in the definition below. 

Definition 3 (Flexibility of stochastic processes) 

Let f and g be two real valued, almost surely C 1 stochastic processes indexed on and whose 
gradients have a finite entropy everywhere (i.e. V x, H(V f(x)), H(Vg(x)) < oo). We say that f 
is more flexible than g if the following conditions are met: 

1) Vx, H(Xf(x))>H(Xg(x)), 

2) Vxfy, J(V/(x); V/(y)) < I(V 5 (s); Xg(y)), 

where H is the entropy operator, and I(X ; Y) = H(X) + H(Y ) — H(X, Y ) stands for the mutual 
information between X and Y. 

The following proposition establishes that the link function fi s [x \,..., xf) = J2i=j x j yields more 
flexible stationary string GPs than their isotropic GP counterparts, thereby providing a theoretical 
underpinning for not inferring fi. 

Proposition 4 (Additively separable string GPs are flexible) 

Let k(x,y ) := p (||x — y\ || 2 ) be a stationary covariance function generating a.s. C 1 GP paths in¬ 
dexed on M d , d > 0, and pa function that is C 2 on ]0, +oo[ and continuous at 0. Let (j) s (x i,..., xf) = 
Y^)=i x j> l et ( z t )teP, je[i..d] b e independent stationary Gaussian processes with mean 0 and co- 
variance function k (where the L 2 norm is on M), and let f(t\, ... ,tf) = fi s (zl 1 , ..., zf ) be the 
corresponding stationary string GP. Finally, let g be an isotropic Gaussian process indexed on 
I 1 X • • • X I d with mean 0 and covariance function k (where the L 2 norm is on W 1 ). Then: 

1) V x E I 1 x ■ ■ ■ x I d , H(Xf(x)) = H(Xg(x)), 

2) V x + y E I 1 x ■ ■ ■ x I d , I(V/(x); Vf(y)) < I(Xg(x)-Xg(y)). 

Proof See Appendix E. ■ 

Although the link function need not be inferred in a full nonparametric fashion to yield compara¬ 
ble if not better results than most isotropic kernels used in the standard GP paradigm, for some 
problems certain link functions might outperform others. In Section 4.2 we analyse a broad family 
of link functions, and argue that they extend successful anisotropic approaches such as the Au¬ 
tomatic Relevance Determination (MacKay (1998)) and the additive kernels of Duvenaud et al. 
(2011). Moreover, in Section 5 we propose a scalable inference scheme applicable to any link 
function. 

4. Comparison with the Standard GP Paradigm 

We have already established that sampling string GPs scales better than sampling GPs under the 
standard GP paradigm and is amenable to distributed computing. We have also established that 
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stationary additively separable string GPs are more flexible than their - isotropic counterparts in the 
standard GP paradigm. In this section, we provide further theoretical results relating the string 
GP paradigm to the standard GP paradigm. Firstly we establish that string GPs with link function 
</!>., (x\.... , Xfi) = Y% = j Xj are GPs. Secondly, we derive the global mean and covariance functions 
induced by the string GP construction for a variety of link functions. Thirdly, we provide a sense 
in which the string GP paradigm can be thought of as extending the standard GP paradigm. And 
finally, we show that the string GP paradigm may serve as a scalable approximation of commonly 
used stationary kernels. 

4.1 Some String GPs are GPs 

On one hand we note from Theorem 2 that the restriction of a string GP defined on an interval to 
the support of the first string—in other words the first local GP expert—is a Gaussian process. On 
the other hand, the messages passed on from one local GP expert to the next are not necessarily 
consistent with the unconditional law of the receiving local expert, so that overall a string GP 
defined on an interval, that is when looked at globally and unconditionally, might not be a Gaussian 
process. However, the following proposition establishes that some string GPs are indeed Gaussian 
processes. 

Proposition 5 (Additively separable string GPs are GPs) 

String Gaussian processes on M are Gaussian processes. Moreover, string Gaussian processes on 
with link function <j> s {x i,..., xf) = x j are a ^ so Gaussian processes. 

Proof The intuition behind this proof lies in the fact that if X is a multivariate Gaussian, and if 
conditional on X, Y is a multivariate Gaussian, providing that the conditional mean of Y depends 
linearly on X and the conditional covariance matrix of Y does not depend on X, the vector (X, Y) 
is jointly Gaussian. This will indeed be the case for our collaboration of local GP experts as the 
boundary conditions picked up by an expert from the previous will not influence the conditional co- 
variance structure of the expert (the conditional covariance strucuture depends only on the partition 
of the domain, not the values of the boundary conditions) and will affect the mean linearly. See 
Appendix H for the full proof. ■ 


The above result guarantees that commonly used closed form predictive equations under GP 
priors are still applicable under some string GP priors, providing the global mean and covariance 
functions, which we derive in the following section, are available. Proposition 5 also guarantees 
stability of the corresponding string GPs in the GP family under addition of independent Gaussian 
noise terms as in regression settings. Moreover, it follows from Proposition 5 that inference tech¬ 
niques developed for Gaussian processes can be readily used under string GP priors. In Section 5 
we provide an additional MCMC scheme that exploits the conditional independence between strings 
to yield greater scalability and distributed inference. 

4.2 String GP Kernels and String GP Mean Functions 

The approach we have adopted in the construction of string GPs and membrane GPs did not re¬ 
quire explicitly postulating a global mean function or covariance function. In Appendix I we derive 
the global mean and covariance functions that result from our construction. The global covariance 
function could be used for instance as a stand-alone kernel in any kernel method, for instance GP 
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models under the standard GP paradigm, which would provide a flexible and nonstationary al¬ 
ternative to commonly used kernels that may be used to learn local patterns in data sets—some 
successful example applications are provided in Section 5. That being said, adopting such a global 
approach should be limited to small scale problems as the conditional independence structure of 
string GPs does not easily translate into structures in covariance matrices over string GP values 
(without derivative information) that can be exploited to speed-up SVD or Cholesky decomposi¬ 
tion. Crucially, marginalising out all derivative information in the distribution of derivative string 
GP values at some inputs would destroy any conditional independence structure, thereby limiting 
opportunities for scalable inference. In Section 5 we will provide a RJ-MCMC inference scheme 
that fully exploits the conditional independence structure in string GPs and scales to very large data 
sets. 


4.3 Connection Between Multivariate String GP Kernels and Existing Approaches 

We recall that for n < d, the n-th order elementary symmetric polynomial (Macdonald (1995)) is 
given by 


n 

eo{xi,... ,x d ) := 1, VI < n < d e n (xi,... ,x d ) = ^ ( 12 ) 

t<jl<32<—<jn<d k= 1 


As an illustration, 


d 

ei(xi, ...,x d ) = y^Xj = (j> s (x i,.. .,x d ), 

3 =1 


e 2 (xi, ...,x d ) = x\x 2 + xix 3 -I-b xix d H-h x d _ x x d , 


d 

e d (x x , • • -,x d ) = JJxj = <j) p (x i,.. .,x d ). 

3 = 1 

Covariance kernels of string GPs, using as link functions elementary symmetric polynomials e n , 
extend most popular approaches that combine unidimensional kernels over features for greater flex¬ 
ibility or cheaper design experiments. 

The first-order polynomial e x gives rise to additively separable Gaussian processes, that can be 
regarded as Bayesian nonparametric generalised additive models (GAM), particularly popular for 
their interpretability. Moreover, as noted by Durrande et al. (2012), additively separable Gaussian 
processes are considerably cheaper than alternate transformations in design experiments with high¬ 
dimensional input spaces. In addition to the above, additively separable string GPs also allow 
postulating the existence of local properties in the experimental design process at no extra cost. 

The d- th order polynomial e d corresponds to a product of unidimensional kernels, also known 
as separable kernels. For instance, the popular squared exponential kernel is separable. Separable 
kernels have been successfully used on large scale inference problems where the inputs form a grid 
(Saatchi, 2011; Wilson et ah, 2014), as they yield covariance matrices that are Kronecker products, 
leading to maximum likelihood inference in linear time complexity and with linear memory require¬ 
ment. Separable kernels are often used in conjunction with the automatic relevance determination 
(ARD) model, to learn the relevance of features through global linear rescaling. However, ARD 
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kernels might be limited in that we might want the relevance of a feature to depend on its value. As 
an illustration, the market value of a watch can be expected to be a stronger indicator of its owner’s 
wealth when it is in the top 1 percentile, than when it is in the bottom 1 percentile; the rationale 
being that possessing a luxurious watch is an indication that one can afford it, whereas possessing 
a cheap watch might be either an indication of lifestyle or an indication that one cannot afford a 
more expensive one. Separable string GP kernels extend ARD kernels, in that strings between input 
dimensions and within an input dimension may have unconditional kernels with different hyper¬ 
parameters, and possibly different functional forms, thereby allowing for automatic local relevance 
determination (ALRD). 

More generally, using as link function the n-th order elementary symmetric polynomial e n 
corresponds to the n-th order interaction of the additive kernels of Duvenaud et al. (2011). We 
also note that the class of link functions 4){x\. ..., xf) = Ylt =t a i e i{ x U • • • ■ x <i) yield full additive 
kernels. Duvenaud et al. (2011) noted that such kernels are ‘exceptionally well-suited’ to learn non¬ 
local structures in data. String GPs complement additive kernels by allowing them to learn local 
structures as well. 

4.4 String GPs as Extension of the Standard GP Paradigm 

The following proposition provides a perspective from which string GPs may be considered as 
extending Gaussian processes on an interval. 

Proposition 6 ( Extension of the standard GP paradigm) 

Let K G N*, let I = [ao, ax] and I k = [a*_i, a k ] be intervals with ao < ■ ■ ■ < clk- Furthermore, 
let m : I —>• M be a C 1 function, rrik the restriction of m to Ik, h : I X / —> M a C 3 symmetric 
positive semi-definite function, and hr the restriction ofh to I k x If 

(zfjtei ~ SQV({a k }, { m k }, {hi J), 

then 

Vke[l..K], (z t ) teIk ~ gV(m,h). 


Proof See Appendix F. 


We refer to the case where unconditional string mean and kernel functions are restrictions of 
the same functions as in Proposition 6 as uniform string GPs. Although uniform string GPs arc 
not guaranteed to be as much regular at boundary times as their counterparts in the standard GP 
paradigm, we would like to stress that they may well generate paths that are. In other words, the 
functional space induced by a uniform string GP on an interval extends the functional space of the 
GP with the same mean and covariance functions m and h taken globally and unconditionally on 
the whole interval as in the standard GP paradigm. This allows for (but does not enforce) less 
regularity at the boundary times. When string GPs are used as functional prior, the posterior mean 
can in fact have more regularity at the boundary times than the continuous differentiability enforced 
in the string GP paradigm, providing such regularity is evidenced in the data. 

We note from Proposition 6 that when m is constant and h is stationary, the restriction of the 
uniform string GP ( zt)tei to any interval whose interior does not contain a boundary time, the 
largest of which being the intervals [a k ~ i, a k \, is a stationary GP. We refer to such cases as partition 
stationary string GPs. 
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4.5 Commonly Used Covariance Functions and their String GP Counterparts 

Considering the superior scalability of the string GP paradigm, which we may anticipate from the 
scalability of sampling string GPs, and which we will confirm empirically in Section 5, a natural 
question that comes to mind is whether or not kernels commonly used in the standard GP paradigm 
can be well approximated by string GP kernels, so as to take advantage of the improved scalability 
of the string GP paradigm. We examine the distortions to commonly used covariance structures 
resulting from restricting strings to share only C 1 boundary conditions, and from increasing the 
number of strings. 

Figure 3 compares some popular stationary kernels on [0,1] X [0,1] (first column) to their uni¬ 
form string GP kernel counterparts with 2, 4, 8 and 16 strings of equal length. The popular ker¬ 
nels considered are the squared exponential kernel (SE), the rational quadratic kernel kRQ(u, v ) = 

^1 + j with a = 1 (RQ 1) and a = 5 (RQ 5), the Matern 3/2 kernel (MA 3/2), and the 

Matern 5/2 kernel (MA 5/2), each with output scale (variance) 1 and input scale 0.5. Firstly, we ob¬ 
serve that each of the popular kernels considered coincides with its uniform string GP counterparts 
regardless of the number of strings, so long as the arguments of the covariance function are less 
than an input scale apart. Except for the Matern 3/2, the loss of information induced by restricting 
strings to share only C 1 boundary conditions becomes noticeable when the arguments of the co- 
variance function are more than 1.5 input scales apart, and the effect is amplified as the number of 
strings increases. As for the Matern 3/2, no loss of information can been noticed, as further attests 
Table 1. In fact, this comes as no surprise given that stationary Matern 3/2 GP are 1-Markov, that is 
the corresponding derivative Gaussian process is a Markov process so that the vector ( zt , z' t ) con¬ 
tains as much information as all string GP or derivative values prior to t (see Doob (1944)). Table 1 
provides some statistics on the absolute errors between each of the popular kernels considered and 
uniform string GP counterparts. 
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Figure 3: Commonly used covariance functions on [0,1] X [0,1] with the same input and output 
scales (first column) and their uniform string GP counterparts with K > 1 strings of 
equal length. 
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5. Inference under String and Membrane GP Priors 

In this section we move on to developing inference techniques for Bayesian nonparametric inference 
of latent functions under string GP priors. We begin with marginal likelihood inference in regression 
problems. We then propose a novel reversible-jump MCMC sampler that enables automatic learning 
of model complexity (that is the number of different unconditional kernel configurations) from the 
data, with a time complexity and memory requirement both linear in the number of training inputs. 

5.1 Maximum Marginal Likelihood for Small Scale Regression Problems 

Firstly, we leverage the fact that additively separable string GPs are Gaussian processes to perform 
Bayesian nonparametric regressions in the presence of local patterns in the data, using standard 
Gaussian process techniques (see Rasmussen and Williams, 2006, p.l 12 §5.4.1). We use as genera¬ 
tive model 


Vi = f(xi) + G, €i ~ AT (0, erf.) , erf. > 0, Xi <E I 1 x • • • x I d , y h qGR 

we are given the training data set V = { x r , ?/,;}f i.. ,?v] • and we place a mean-zero additively separa¬ 
ble string GP prior on /, namely 

d 

/(®) = Y z x\jv (4) ~ S< J V {°}> i k l}) > < i, (4) - 1 ( g )> 

3 =i 


which we assume to be independent of the measurement noise process. Moreover, the noise terms 
are assumed to be independent, and the noise variance erf affecting f(xi) is assumed to be the 
same for any two inputs whose coordinates lie on the same string intervals. Such a heteroskedastic 
noise model fits nicely within the string GP paradigm, can be very useful when the dimension of 
the input space is small, and may be replaced by the typical constant noise variance assumption in 
high-dimensional input spaces. 

Let us define y = (yi,... ,y N ), X = (xi,.. .,x N ), f = (f(x i),. . .,f(x N )) and let K X; x de¬ 
note the auto-covariance matrix of f (which we have derived in Section 4.2), and let D = diag({crf,}) 
denote the diagonal matrix of noise variances. It follows that y is a Gaussian vector with mean 0 
and auto-covariance matrix K y := K X; x + D and that the log marginal likelihood reads: 


logp y 


x ,Wki},{0l}’{ a i}) = -7,y 7 K v 'y - \ logdet (Ky) - ^log2vr. 


(13) 


We obtain estimates of the string measurement noise standard deviations and estimates of the 
string hyper-parameters {6 J k } by maximising the marginal likelihood for a given domain partition 
{a \}, using gradient-based methods. We deduce the predictive mean and covariance matrix of the 
latent function values f* at test points X*, from the estimates { 0 J k }, {<t/ C i } as 

E(r|y) = K x . ;X K y 1 y and cov(f |y) = K X * ;X * - Kx^xK^ 1 ^*., (14) 


using the fact that (f*, y) is jointly Gaussian, and that the cross-covariance matrix between f* and 
y is K x * ; x as the additive measurement noise is assumed to be independent from the latent process /. 
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5.1.1 Remarks 

The above analysis and equations still hold when a GP prior is placed on / with one of the multi¬ 
variate string GP kernels derived in Section 4.2 as covariance function. 

It is also worth noting from the derivation of string GP kernels in Appendix I that the marginal 
likelihood Equation (13) is continuously differentiable in the locations of boundary times. Thus, for 
a given number of boundary times, the positions of the boundary times can be determined as part 
of the marginal likelihood maximisation. The derivatives of the marginal log-likelihood (Equation 
13) with respect to the aforementioned locations { aj : } can be determined from the recursions of 
Appendix I, or approximated numerically by finite differences. The number of boundary times 
in each input dimension can then be learned by trading off model fit (the maximum marginal log 
likelihood) and model simplicity (the number of boundary times or model parameters), for instance 
using information criteria such as AIC and BIC. When the input dimension is large, it might be 
advantageous to further constrain the hypothesis space of boundary times before using information 
criteria, for instance by assuming that the number of boundary times is the same in each dimension. 
An alternative Bayesian nonparametric approach to learning the number of boundary times will be 
discussed in section 5.4. 

This method of inference cannot exploit the structure of string GPs to speed-up inference, and as 
a result it scales like the standard GP paradigm. In fact, any attempt to marginalize out univariate 
derivative processes, including in the prior, will inevitably destroy the conditional independence 
structure. Another perspective to this observation is found by noting from the derivation of global 
string GP covariance functions in Appendix I that the conditional independence structure does not 
easily translate in a matrix structure that may be exploited to speed-up matrix inversion, and that 
marginalizing out terms relating to derivatives processes as in Equation (13) can only make things 
worse. 

5.2 Generic Reversible-Jump MCMC Sampler for Large Scale Inference 

More generally, we consider learning a smooth real-valued latent function /, defined on a d- 
dimensional hyper-rectangle, under a generative model with likelihood p (X>|f, u), where f denotes 
values of / at training inputs points and u denotes other likelihood parameters that are not re¬ 
lated to /. A large class of machine learning problems aiming at inferring a latent function have a 
likelihood model of this form. Examples include celebrated applications such as nonparametric re¬ 
gression and nonparametric binary classification problems, but also more recent applications such as 
learning a profitable portfolio generating-function in stochastic portfolio theory (Karatzas and Fern- 
holz (2009)) from the data. In particular, we do not assume that p(V |f, u) factorizes over training 
inputs. Extensions to likelihood models that depend on the values of multiple latent functions are 
straight-forward and will be discussed in Section 5.3. 

5.2.1 Prior Specification 

We place a prior p( u) on other likelihood parameters. For instance, in regression problems under 
a Gaussian noise model, u can be the noise variance and we may choose p( u) to be the inverse- 
Gamma distribution for conjugacy. We place a mean-zero string GP prior on / 



22 


String and Membrane Gaussian Processes 


As discussed in Section 3.5, the link function <f> need not be inferred as the symmetric sum was 
found to yield a sufficiently flexible functional prior. Nonetheless, in this section we do not impose 
any restriction on the link function 4> other than continuous differentiability. Denoting z the vector 
of univariate string GP processes and their derivatives, evaluated at all distinct input coordinate 
values, we may re-parametrize the likelihood as p(T> |z,u), with the understanding that f can be 
recovered from z through the link function cf>. To complete our prior specification, we need to 
discuss the choice of boundary times { u J k } and the choice of the corresponding unconditional kernel 
structures { kj ,}. Before doing so, we would like to stress that key requirements of our sampler are 
that i) it should decouple the need for scalability from the need for flexibility, ii) it should scale 
linearly with the number of training and test inputs, and iii) the user should be able to express 
prior views on model complexity/flexibility in an intuitive way, but the sampler should be able to 
validate or invalidate the prior model complexity from the data. While the motivations for the last 
two requirements are obvious, the first requirement is motivated by the fact that a massive data set 
may well be more homogeneous than a much smaller data set. 

5.2.2 Scalable Choice of Boundary Times 

To motivate our choice of boundary times that achieves great scalability, we first note that the 
evaluation of the likelihood, which will naturally be needed by the MCMC sampler, will typically 
have at least linear time complexity and linear memory requirement, as it will require performing 
computations that use each training sample at least once. Thus, the best we can hope to achieve 
overall is linear time complexity and linear memory requirement. Second, in MCMC schemes with 
functional priors, the time complexity and memory requirements for sampling from the posterior 

p(f|D) ocp(D|f)p(f) 

are often the same as the resource requirements for sampling from the prior p( f), as evaluating 
the model likelihood is rarely the bottleneck. Finally, we note from Algorithm 1 that, when each 
input coordinate in each dimension is a boundary time, the sampling scheme has time complexity 
and memory requirement that are linear in the maximum number of unique input coordinates across 
dimensions, which is at most the number of training samples. In effect, each univariate derivative 
string GP is sampled in parallel at as many times as there are unique input coordinates in that di¬ 
mension, before being combined through the link function. In a given input dimension, univariate 
derivative string GP values are sampled sequentially, one boundary time conditional on the previ¬ 
ous. The foregoing sampling operation is very scalable not only asymptotically but also in absolute 
terms; it merely requires storing and inverting at most as many 2x2 matrices as the number of 
input points. We will evaluate the actual overall time complexity and memory requirement when 
we discuss our MCMC sampler in greater details. For now, we would like to stress that i) choosing 
each distinct input coordinate value as a boundary time in the corresponding input dimension before 
training is a perfectly valid choice, ii) we expect this choice to result in resource requirements that 
grow linearly with the sample size and iii) in the string GP theory we have developed thus far there 
is no requirement that two adjacent strings be driven by different kernel hyper-parameters. 

5.2.3 Model Complexity Learning as a Change-Point Problem 

The remark iii) above pertains to model complexity. In the simplest case, all strings are driven by the 
same kernel and hyper-parameters as it was the case in Section 4.5, where we discussed how this 
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setup departs from postulating the unconditional string covariance function kj, globally similarly 
to the standard GP paradigm. The more distinct unconditional covariance structures there are, 
the more complex the model is, as it may account for more types of local patterns. Thus, we may 
identify model complexity to the number of different kernel configurations across input dimensions. 
In order to learn model complexity, we require that some (but not necessarily all) strings share their 
kernel configuration. 7 Moreover, we require kernel membership to be dimension-specific in that 
two strings in different input dimensions may not explicitly share a kernel configuration in the prior 
specification, although the posterior distribution over their hyper-parameters might be similar if the 
data support it. 

In each input dimension j, kernel membership is defined by a partition of the corresponding 
domain operated by a (possibly empty) set of change-points, 8 as illustrated in Figure 4. When there 
is no change-point as in Figure 4-(a), all strings are driven by the same kernel and hyper-parameters. 
Each change-point Cp induces a new kernel configuration Oj, that is shared by all strings whose 
boundary times a 3 k and a k +1 both lie in [cp,c^ +1 [. When one or multiple change-points cjp occur 
between two adjacent boundary times as illustrated in Figures 4-(b-d), for instance a{ < c 7 < a{ +v 
the kernel configuration of the string defined on [aj. , a 3 k , } ] is that of the largest change-point that lies 
in [a k , a^ +1 ] (see for instance Figure 4-(d)). For consistency, we denote 0 : { l the kernel configuration 
driving the first string in the j- th dimension; it also drives strings that come before the first change- 
point, and all strings when there is no change-point. 

To place a prior on model complexity, it suffices to define a joint probability measure on the 
set of change-points and the corresponding kernel configurations. As kernel configurations are 
not shared across input dimensions, we choose these priors to be independent across input dimen¬ 
sions. Moreover, {cp} being a random collection of points on an interval whose number and po¬ 
sitions are both random, it is de facto a point process (Daley and Vere-Jones (2008)). To keep the 
prior specification of change-points uninformative, it is desirable that conditional on the number of 
change-points, the positions of change-points be i.i.d. uniform on the domain. As for the number 
of change-points, it is important that the support of its distribution not be bounded, so as to allow 
for an arbitrarily large model complexity if warranted. The two requirements above are satisfied by 
a homogeneous Poisson process or HPP (Daley and Vere-Jones (2008)) with constant intensity AT 
More precisely, the prior probability measure on Oj ,}, A J j is constructed as follows: 

'X> ~ r (a?,P), 

< {4} | A 7 ~ HPP(AJ) 

Op[i]\{ 4>}A 3 ~ d logAA(0,/P) 

y(j,p) + (l,q) Op -L 0 l q , 

where we choose the Gamma distribution F as prior on the intensity A- 7 for conjugacy, we assume all 
kernel hyper-parameters are positive as is often the case in practice, 9 the coordinates of the hyper¬ 
parameters of a kernel configuration are assumed i.i.d., and kernel hyper-parameters are assumed 

7. That is, the functional form of the unconditional kernel k J k and its hyper-parameters. 

8. We would like to stress that change-points do not introduce new input points or boundary times, but solely define a 
partition of the domain of each input dimension. 

9. This may easily be relaxed if needed, for instance by putting normal priors on parameters that may be negative and 
log-normal priors on positive parameters. 
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(a) 



Figure 4: Effects of domain partition through change-points (coloured circles), on kernel member¬ 
ship. Each vertical bar corresponds to a distinct boundary time a 3 k . For the same collection 
of boundary times, we consider four scenarios: (a) no partition, (b) partition of the domain 
in two by a single change-point that does not coincide with any existing boundary time, 
(c) partition of the domain in three by two change-points, one of which coincides with 
an existing boundary time, and (d) partition of the domain in two by two distinct change- 
points. In each scenario, kernel membership is illustrated by colour-coding. The colour 
of the interval between two consecutive boundary times a k and a? k+l reflects what ker¬ 
nel configuration drives the corresponding string; in particular, the colour of the vertical 
bar corresponding to boundary time a k+1 determines what kernel configuration should 
be used to compute the conditional distribution of the value of the derivative string GP 
(z 3 t ,z 3 t ) at a 3 k+v given its value at a k . 
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independent between kernel configurations. Denoting the domain of the j-th input [a- 7 , IP], it follows 
from applying the laws of total expectation and total variance on Equation (16) that the expected 
number of change-points in the j-th dimension under our prior is 

/yj 

E(#{c p }) = (V-a?)p, (17) 

and the variance of the number of change-points in the j-dimension under our prior is 

Var(#{<7}) = (y-a7)^l + fc/lj. (18) 

The two equations above may guide the user when setting the parameters cr 7 and . For instance, 
these values may be set so that the expected number of change-points in a given input dimension be a 
fixed fraction of the number of boundary times in that input dimension, and so that the prior variance 
over the number of change-points be large enough that overall the prior isn’t too informative. 

We could have taken a different approach to construct our prior on change-points. In effect, 
assuming for the sake of the argument that the boundaries of the domain of the j-th input, namely 
a- 7 and IP, are the first and last change-point in that input dimension, we note that the mapping 

( c 7 ) -> ( p> ) ■=( 4+1-^ 

Cp,.. . I..., b j_ a J 

defines a bijection between the set of possible change-points in the j-th dimension and the set 
of all discrete probability distributions. Thus, we could have placed as prior on ^... .pj,, ... j a 
Dirichlet process (Ferguson (1973)), a Pitman-Yor process (Pitman and Yor (1997)), more generally 
normalized completely random measures (Kingman (1967)) or any other probability distribution 
over partitions. We prefer the point process approach primarily because it provides an easier way 
of expressing prior belief about model complexity through the expected number of change-points 
#{cp}, while remaining uninformative about positions thereof. 

One might also be tempted to regard change-points in an input dimension j as inducing a par¬ 
tition, not of the domain [a- 7 , IP], but of the set of boundary times aj, in the same dimension, so 
that one may define a prior over kernel memberships through a prior over partitions of the set of 
boundary times. However, this approach would be inconsistent with the aim to learn local patterns 
in the data if the corresponding random measure is exchangeable. In effect, as boundary times are 
all input coordinates, local patterns may only arise in the data as a result of adjacent strings sharing 
kernel configurations. An exchangeable random measure would postulate a priori that two kernel 
membership assignments that have the same kernel configurations (i.e. the same number of configu¬ 
rations and the same set of hyper-parameters) and the same number of boundary times in each kernel 
cluster (although not exactly the same boundary times), are equally likely to occur, thereby possibly 
putting more probability mass on kernel membership assignments that do not respect boundary time 
adjacency. Unfortunately, exchangeable random measures (among which the Dirichlet process and 
the Pitman-Yor process) are by far more widely adopted by the machine learning community than 
non-exchangeable random measures. Thus, this approach might be perceived as overly complex. 
That being said, as noted by Foti and Williamson (2015), non-exchangeable normalized random 
measures may be regarded as Poisson point processes (with varying intensity functions) on some 
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augmented spaces, which makes this choice of prior specification somewhat similar, but stronger 
(that is more informative) than the one we adopt in this paper. 

Before deriving the sampling algorithm, it is worth noting that the prior defined in Equation (16) 
does not admit a density with respect to the same base measure, 10 as the number of change-points 
#{qp}, and subsequently the number of kernel configurations, may vary from one sample to another. 
Nevertheless, the joint distribution over the data V and all other model parameters is well defined 
and, as we will see later, we may leverage reversible-jump MCMC techniques (Green (1995); Green 
and Hastie (2009)) to construct a Markov chain that converges to the posterior distribution. 

5.2.4 Overall Structure op the MCMC Sampler 

To ease notations, we denote c the set of all change-points in all input dimensions, we denote 
n = ^..., #{cp},... ^ G the vector of the numbers of change-points in each input dimension, 

we denote 6 the set of kernel hyper-parameters, 11 and p :=(..., p 7 ,...) the vector of variances of 
the independent log-normal priors on 6. We denote A :=(..., A- 7 ,...) the vector of change-points 
intensities, we denote a :=(..., a- 7 ,...) and (3 :=(..., ft ,...) the vectors of parameters of the 
Gamma priors we placed on the change-points intensities across the d input dimensions, and we 
recall that u denotes the vector of likelihood parameters other than the values of the latent function 
/■ 

We would like to sample from the posterior distribution p( f, f*, Vf, Vf* \T>, a, (3, p), where f 
and f* are the vectors of values of the latent function / at training and test inputs respectively, and 
Vf, Vf* the corresponding gradients. Denoting z the vector of univariate string GP processes and 
their derivatives, evaluated at all distinct training and test input coordinate values, we note that to 
sample from p( f, f*, Vf, Vf*|D, a, (3 , p), it suffices to sample from p(z\V, a , /3, p), compute f 
and f* using the link function, and compute the gradients using Equation (11). To sample from 
p(z\V, a, (3 , p), we may sample from the target distribution 

7r(n, c, 6 , A, z, u) := p(n, c, 6 , A, z, u|D, a, /3, p), (19) 

and discard variables that are not of interest. As previously discussed, it is not absolutely continuous 
with respect to the same base measure, though we may still decompose it as 

vr(n, c, 6 , A, z, u) = — -p(n\X)p(X\a, (3)p(c\n)p(0\n, p)p(u)p(z\c, 0)p{V |z, u), 

p(V\a,/3,p) 

(20) 

where we use the notation p(.) and />(.[.) to denote probability measures rather than probability 
density functions or probability mass functions, and where product and scaling operations are usual 
measure operations. Before proceeding any further, we will introduce a slight re-parametrization of 
Equation (20) that will improve the inference scheme. 

Let n a = (..., ...) be the vector of the numbers of unique boundary times in all d 

input dimensions. We recall from our prior on / that 



10. That is the joint prior probability measure is neither discrete, nor continuous. 

11. To simplify the exposition, we assume without loss of generality that each kernel configuration has the same kernel 
functional form, so that configurations are defined by kernel hyper-parameters. 
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where each factor in the decomposition above is a bivariate Gaussian density whose mean vector and 
covariance matrix is obtained from the partitions c, the kernel hyper-parameters 0, and the kernel 
membership scheme described in Section 5.2.3 and illustrated in Figure 4, and using Equations 
(6-7). Let be the unconditional covariance matrix between (^zi, zi'^J and (zi, z{,'^j as per 

the unconditional kernel structure driving the string defined on the interval [a 3 k . a 3 k+ { [. Let Eg := 
,(K , j be the auto-covariance matrix of ( z 3 ,, z 3 \ ). Let 

0 a >0 \ a?' a J J 
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(23) 


where {x 3 k } are independent bivariate standard normal vectors. Equations (22-23) provide an equiv¬ 
alent representation. In effect, we recall that if Z = M + LX, where X ~ A/"(0, 1) is a standard 
multivariate Gaussian, M is a real vector, and L is a real matrix, then Z ~ A f(M, LL r ). Equations 

z 3 , , z 3 ',. I. We note 

a i-i a l -1 / 

that at training time, M 3 k and L 3 k only depend on kernel hyper-parameters. Denoting x the vector of 
all x 3 ,, x is a so-called ‘whitened’ representation of z, which we prefer for reasons we will discuss 
shortly. In the whitened representation, the target distribution it is re-parameterized as 

vr(n, c, 6 , A, x, u) = -p(n\X)p{X\a, ( 3 )p(c\n)p( 0 \n, p)p(u)p(x)p(£>|x, c, 6 , u), 

p(V\a,( 3 ,p) 

(24) 

where the dependency of the likelihood term on the partitions and the hyper-parameters stems from 
the need to recover z and subsequently f from x through Equations (22) and (23). The whitened 
representation Equation (24) has two primary advantages. Firstly, it is robust to ill-conditioning of 


(22-23) result from applying this result to 




and 
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which would typically occur when two adjacent boundary times are too close to each other. In 
the representation of Equation (20), as one needs to evaluate the density p(z|c. 0), ill-conditioning 
of Y? k would result in numerical instabilities. In contrast, in the whitened representation, one needs 
to evaluate the density p(x), which is that of i.i.d. standard Gaussians and as such can be evaluated 
robustly. Moreover, the SVD required to evaluate L 3 k is also robust to ill-conditioning of Y? k , so 
that Equations (22) and (23) hold and can be robustly evaluated for degenerate Gaussians too. The 
second advantage of the whitened representation is that it improves mixing by establishing a link 
between kernel hyper-parameters and the likelihood. 

Equation (24) allows us to cast our inference problem as a Bayesian model selection problem 
under a countable family of models indexed by n £ N' / , each defined on a different parameter 
subspace C n , with cross-model normalizing constant p(V\cx,f3, p), model probability driven by 
p(n\X)p(X\a, (3), model-specific prior p(c\n)p(6\n, p)p(u)p(x), and likelihood p(V |x, c, 0, u). 
Critically, it can be seen from Equation (24) that the conditional probability distribution 

7 r(c, 0, A, x, u| n) 

admits a density with respect to Lebesgue’s measure on C n . 

Our setup is therefore analogous to that which motivated the seminal paper Green (1995), so 
that to sample from the posterior 7r(c, 0, A, x, u, n) we may use any Reversible-Jump Metropolis- 
Hastings (RJ-MH) scheme satisfying detailed balance and dimension-matching as described in sec¬ 
tion 3.3 of Green (1995). To improve mixing of the Markov chain, we will alternate between 
a between-models RJ-MH update with target distribution 7r(n, c, 0, A, x, u), and a within-model 
MCMC-within-Gibbs sampler with target distribution 7r(c, 0, A, x, u|n). Constructing reversible- 
jump samplers by alternating between within-model sampling and between-models sampling is 
standard practice, and it is well-known that doing so yields a Markov chain that converges to the 
target distribution of interest (see Brooks et ah, 201 1, p. 50). 

In a slight abuse of notation, in the following we might use the notations p(.|.) and />(.), which 
we previously used to denote probability measures, to refer to the corresponding probability density 
functions or probability mass functions. 

5.2.5 Within-ModelUpdates 

We recall from Equation (24) that c, 0, A, x, u\T>, a, /3, p, n has probability density function 

p(n\X)p(X\a, f3)p(c\n)p(G\n 1 p)p(u)p(x)p(V\x, c, 0, u), (25) 

up to a normalizing constant. 

Updating A: By independence of the priors over (A[j],n[y]), the distributions A[j] | n[j] are 
also independent, so that the updates may be performed in parallel. Moreover, recalling that the 
prior number of change-points in the j-th input dimension is Poisson distributed with intensity 
X{j] (pi — a 3 ), and by conjugacy of the Gamma distribution to the Poisson likelihood, it follows 
that 

A b11 n \j] ~ r + a b1> 1 + P\j]) ■ ( 26 ) 

This update step has memory requirement and time complexity both constant in the number of 
training and test samples. 
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Updating u: When the likelihood has additional parameters u, they may be updated with a 
Metropolis-Hastings step. Denoting q( u —> u') the proposal probability density function, the ac¬ 
ceptance ratio reads 


/ p(u')p (Pjx, c, 0, iT) q{u' -A u) \ 

V ’ p( u )p (P\*, c, 0, u) q( u -> u') y ' 


(27) 


In some cases however, it might be possible and more convenient to choose p(u) to be conjugate to 
the likelihood p (V\x, c, 0, u). For instance, in regression problems under a Gaussian noise model, 
we may take u to be the noise variance on which we may place an inverse-gamma prior. Either way, 
the computational bottleneck of this step is the evaluation of the likelihood p (V\x, c, 0, u'), which 
in most cases can be done with a time complexity and memory requirement that are both linear in 
the number of training samples. 

Updating c: We update the positions of change-points sequentially using the Metropolis- 
Hastings algorithm, one input dimension j at a time, and for each input dimension we proceed 
in increasing order of change-points. The proposal new position for the change-point Cp is sampled 
uniformly at random on the interval c^ +1 [, where c? p _± (resp. c p ,) is replaced by a J (resp. 
b>) for the first (resp. last) change-point. The acceptance probability of this proposal is easily found 
to be 


r j = min 

Cp 


V ’ p(V |x,c,0,u) ) 


(28) 


where c' is identical to c except for the change-point to update. This step requires computing 
the factors {L{, Ml} corresponding to inputs in j-th dimension whose kernel configuration would 
change if the proposal were to be accepted, the corresponding vector of derivative string GP values 
z, and the observation likelihood under the proposal p (D|x, c 7 , 9, u). The computational bottleneck 
of this step is therefore once again the evaluation of the new likelihood p (P|x. <7. 6 , u). 

Updating x: The target conditional density of x is proportional to 


p(x)p(D|x,c,0,u). (29) 

Recalling that p(x) is a multivariate standard normal, it follows that the form of Equation (29) 
makes it convenient to use elliptical slice sampling (Murray et al. (2010)) to sample from the un- 
ormalized conditional p(x)p(V |x, c, 6, u). The two bottlenecks of this update step are sampling a 
new proposal from p(x) and evaluating the likelihood p(V |x, c, 0, u). Sampling from the multi¬ 
variate standard normal p(x) may be massively parallelized, for instance by using GPU Gaussian 
random number generators. When no parallelism is available, the overall time complexity reads 
° (zU n a[j]), where we recall that n a [j] denotes the number of distinct training and testing 
input coordinates in the j-th dimension. In particular, if we denote N the total number of training 
and testing d-dimensional input samples, then Y^j=\ n a[j] < dN, although for many classes of 
data sets with sparse input values such as images, where each input (single-colour pixel value) may 
have at most 256 distinct values, we might have J2j=i wjj] dN. As for the memory required to 

sample from p(x), it grows proportionally to the size of x, that is in O n Q [j] j. In regards to 

the evaluation of the likelihood p(V |x, c, 0, u), as previously discussed its resource requirements 
are application-specific, but it will typically have time complexity that grows in O (N) and memory 
requirement that grows in O (dN). For instance, the foregoing resource requirements always hold 
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for i.i.d. observation models such as in nonparametric regression and nonparametric classification 
problems. 

Updating 9: We note from Equation (25) that the conditional distribution of 6 given everything 
else has unormalized density 


p(O\n,p)p{V\x.,c,0,u ), (30) 

which we may choose to represent as 

p(log 9\n, p)p(V\x, c, log 0, u). (31) 

As we have put independent log-normal priors on the coordinates of 6 (see Equation 16), we may 
once again use elliptical slice sampling to sample from log 9 before taking the exponential. The time 
complexity of generating a new sample from p(log 9\n. p) will typically be at most linear in the total 
number of distinct kernel hyper-parameters. Overall, the bottleneck of this update is the evaluation 
of the likelihood p(V |x, c, log 9 , u). In this update, the latter operation requires recomputing the 
factors M J k and L? k of Equations (22) and (23), which requires computing and taking the SVD of 
unrelated 2x2 matrices, computations we may perform in parallel. Once the foregoing factors have 
been computed, we evaluate z, the derivative string GP values at boundary times, parallelizing over 
input dimensions, and running a sequential update within an input dimension using Equations (22) 
and (23). Updating z therefore has time complexity that is, in the worst case where no distributed 
computing is available, O ( dN ), and O (N) when there are up to d computing cores. The foregoing 
time complexity will also be that of this update step, unless the observation likelihood is more 
expensive to evaluate. The memory requirement, as in previous updates, is O (dN). 

Overall resource requirement: To summarize previous remarks, the overall computational 
bottleneck of a within-model iteration is the evaluation of the likelihood p(V |x, c, 9 , u). For i.i.d. 
observation models such as classification and regression problems for instance, the corresponding 
time complexity grows in O(N) when d computing cores are available, or 0(dN) otherwise, and 
the memory requirement grows in O(dN). 

5.2.6 Between-Models Updates 

Our reversible-jump Metropolis-Hastings update proceeds as follows. We choose an input dimen¬ 
sion, say j, uniformly at random. If j has no change-points, that is n[j] = 0, we randomly choose 
between not doing anything, and adding a change-point, each outcome having the same probabil¬ 
ity. If n[j] > 0, we either do nothing, add a change-point, or delete a change-point, each outcome 
having the same probability of occurrence. 

Whenever we choose not to do anything, the acceptance ratio is easily found to be one: 

r j0 = 1. (32) 

Whenever we choose to add a change-point, we sample the position d of the proposal new 
change-point uniformly at random on the domain [a- 7 ', Id] of the j-th input dimension. This proposal 
will almost surely break an existing kernel membership cluster, say the p-th, into two; that is d p < 
d < d p+l where we may have a J = c), and/or Id = d +1 . In the event d coincides with an existing 
change-point, which should happen with probability 0, we do nothing. When adding a change-point, 
we sample a new vector of hyper-parameters 6{ from the log-normal prior of Equation (16), and we 
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propose as hyper-parameters for the tentative new clusters [d p , c£ [ and [cl . d pJt , [ the vectors ^dd-ieft 
and 9 3 ., .. defined as 

add-right 


log ^add-left := cos(a) log 9 3 p - sin(a) log 91 (33) 

and 

log ^add-right := sin («) lo S 0J P + cos («) l°g (34) 

respectively, where a € [0, |] and 9j, is the vector of hyper-parameters currently driving the kernel 
membership defined by the cluster [d p ,d p+1 [. We note that if Oj, is distributed as per the prior 
in Equation (16) then lpft and /^ dd _ n „ hl are i.i.d. distributed as per the foregoing prior. More 
generally, this elliptical transformation determines the extent to which the new proposal kernel 
configurations should deviate from the current configuration 9 p . a is restricted to [0, |] so as to 
give a positive weight the the current vector of hyper-parameters 9 P . When a = 0, the left hand- 
side cluster [cp,ci[ will fully exploit the current kernel configuration, while the right hand-side 
cluster [d, d p+1 [ will use the prior to explore a new set of hyper-parameters. When « = f the 
reverse occurs. To preserve symmetry between the left and right hand-side kernel configurations, 
we choose 

« = ?• (35) 

4 

Whenever we choose to delete a change-point, we choose an existing change-point uniformly at 
random, say cjp. Deleting Cp, would merge the clusters \c p _ j, c p [ and [c p . c p+ , [, where we may have 
a 3 = d i and/or b 3 = d p+1 . We propose as vector of hyper-parameters for the tentative merged 
cluster [c 4 .!, d p+1 [ the vector ^ el _ merged satisfying: 

lQ g ^del-merged = COS («) lo g 1 + sin («) lo g ( 3 6) 

which together with 

lo g °L\-* = ~ sin ( a ) lo g 9 p -1 + cos(a) log 9 J p , (37) 

constitute the inverse of the transformation defined by Equations (33) and (34). 

Whenever a proposal to add or delete a change-point occurs, the factors L J k and Ml that would 
be affected by the change in kernel membership structure are recomputed, and so are the affected 
coordinates of z. 

This scheme satisfies the reversibility and dimension-matching requirements of Green (1995). 
Moreover, the absolute value of the Jacobian of the mapping 

(log 0J P , log 91) -> (log^ dd . left , log 0^^) 


of the move to add a change-point in [dp, d p+1 [ reads 



(38) 
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Similarly, the absolute value of the Jacobian of the mapping corresponding to a move to delete 
change-point Cp, namely 



Applying the standard result Equation (8) of Green (1995), the acceptance ratio of the move to add 
a change-point is found to be 


r i+ 


= min I 1 


p(T> |x, C_|_, 0 + , u) A[j] (JP — a- 7 ) Plog 6^ (^°s ^add-left) PlogOi ^add-right y 


p(P|x,c,0,u) 1 + n[j] 


Plog 9i 


P ) Plog Q-i 


(40) 


where p\ og Qj is the prior over log hyper-parameters in the j-th input dimension (as per the prior 
specification Equation 16), which we recall is i.i.d. centred Gaussian with variance p[j], and c + 
and 6 + denote the proposal new vector of change-points and the corresponding vector of hyper¬ 
parameters. The three coloured terms in the acceptance probability are very intuitive. The green 
term — represents the fit improvement that would occur if the new proposal is accepted. 

In the red term — A[j] (V — a- 7 ) represents the average number of change-points in the j- 
th input dimension as per the HPP prior, while 1 + n[j] corresponds to the proposal new number of 
change-points in the j-th dimension, so that the whole red term acts as a complexity regulariser. Fi¬ 
nally, the blue term ( og ( og p] a y s q lc ro le Q f hyper-parameters regulariser. 

Aog ei JAog ei ) 

Similarly, the acceptance ratio of the move to delete change-point Cp, thereby changing the 
number of change-points in the j-th input dimension from n[j] to n[j] — 1, is found to be 


/ p(V\x, C_, 6 , u) n[j] PlogOJ ( lo § ^del-merged) Plog ( lo g^del-*)\ 

^ ’ p(»|x, C, e, u) AH (H - „i) — (log ^ (logej) j ' 


(41) 

where e_ and 6 denote the proposal new vector of change-points and the corresponding vector of 
hyper-parameters. Once more, each coloured term plays the same intuitive role as its counterpart in 
Equation (40). 

Overall resource requirement: The bottleneck of between-models updates is the evaluation of 
the new likelihoods p{V |x, c+, 0 + , u) or p(V |x, c_, u), whose resource requirements, which 

are the same as those of within-nrodels updates, we already discussed. 

Algorithm 2 summarises the proposed MCMC sampler. 


5.3 Multi-Output Problems 

Although we have restricted ourselves to cases where the likelihood model depends on a single 
real-valued function for brevity and to ease notations, cases where the likelihood depends on vector¬ 
valued functions, or equivalently multiple real-valued functions, present no additional theoretical or 
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Algorithm 2 MCMC sampler for nonparametric Bayesian inference of a real-valued latent function 
under a string GP prior 

Inputs: Likelihood model p(V |f, u), link function (P, training data V, test inputs, type of uncon¬ 
ditional kernel, prior parameters «. /3. p. 

Outputs: Posterior samples of the values of the latent function at training and test inputs f and 
f*, and the corresponding gradients Vf and Vf*. 


Step 0: Set n = 0 and c = 0, and sample 6, A, x, u from their priors. 

repeat 

Step 1 : Perform a within-model update. 

1.1: Update each X[j] by sampling from the Gamma distribution in Equation (26). 

1.2: Update u, the vector of other likelihood parameters, if any, using Metropolis-Hastings 
(MH) with proposal q and acceptance ratio Equation (27) or by sampling directly from the 
posterior when p(u) is conjugate to the likelihood model. 

1.3: Update 6, using elliptical slice sampling (ESS) with target distribution Equation (31), 
and record the newly computed factors { L/. Ml } that relate z to its whitened representation 
x. 

1.4: Update x using ESS with target distribution Equation (29). 

1.5: Update change-point positions c sequentially using MH, drawing a proposal update for 
Cp uniformly at random on }c J p _ x , ^ p+ \ [, and accepting the update with probability ry, (defined 

Equation 28). On accept, update the factors {L J k , Ml }. 

Step 2: Perform a between-models update. 

2.1: Sample a dimension to update, say j, uniformly at random. 

2.2: Consider adding or deleting a change-point 
if n[j] =0 then 

Randomly choose to add a change-point with probability 1/2. 
if we should consider adding a change-point then 

Construct proposals to update following Section 5.2.6. 

Accept proposals with probability r 3 + (see Equation 40). 

end if 
else 

Randomly choose to add/delete a change-point with probability 1/3. 
if we should consider adding a change-point then 

Construct proposals to update following Section 5.2.6. 

Accept proposals with probability r J + (see Equation 40). 
else if we should consider deleting a change-point then 
Construct proposals to update following Section 5.2.6. 

Accept proposals with probability r 3 _ (see Equation 41). 
else 

Continue. 

end if 
end if 

Step 3: Compute f, f*, Vf and Vf*, first recovering z from x, and then recalling that f(x) = 



until enough samples are generated after mixing. 
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practical challenge. We may simply put independent string GP priors on each of the latent functions. 
An MCMC sampler almost identical to the one introduced herein may be used to sample from the 
posterior. All that is required to adapt the proposed MCMC sampler to multi-outputs problems is to 
redefine z to include all univariate derivative string GP values across input dimensions and across 
latent functions, perform step 1.1 of Algorithm 2 for each of the latent function, and update step 2.1 
so as to sample uniformly at random not only what dimension to update but also what latent function. 
Previous analyses and derived acceptance ratios remain unchanged. The resource requirements 
of the resulting multi-outputs MCMC sampler on a problem with K latent functions, N training 
and test d-dimensional inputs, are the same as those of the MCMC sampler for a single output 
(Algorithm 2) with N training and test dii-dimensional inputs. The time complexity is O(N) 
when dK computing cores are available, O(dKN) when no distributed computing is available, and 
the memory requirement becomes O(dKN). 

5.4 Flashback to Small Scale GP Regressions with String GP Kernels 

In Section 5.1 we discussed maximum marginal likelihood inference in Bayesian nonparametric 
regressions under additively separable string GP priors, or GP priors with string GP covariance 
functions. We proposed learning the positions of boundary times, conditional on their number, 
jointly with kernel hyper-parameters and noise variances by maximizing the marginal likelihood 
using gradient-based techniques. We then suggested learning the number of strings in each input 
dimension by trading off goodness-of-fit with model simplicity using information criteria such as 
AIC and BIC. In this section, we propose a fully Bayesian nonparametric alternative. 

Let us consider the Gaussian process regression model 

Vi = f(xi) + G, /~£'P(0,fc S Gp(-,-))> G ~ AA (0, a 2 ) , (42) 

Xi G [a. 1 , 6 1 ] X • • • x [a d ,b d ], £l, (43) 

where fcsGP is the covariance function of some string GP with boundary times { a \,} and correspond¬ 
ing unconditional kernels {k 3 k } in the j-th input dimension. It is worth stressing that we place a GP 
(not string GP) prior on the latent function /, but the covariance function of the GP is a string GP 
covariance function (as discussed in Section 4.2 and as derived in Appendix I). Of course when the 
string GP covariance function /csgp is separately additive, the two functional priors are the same. 
However, we impose no restriction on the link function of the string GP that &sgp is the covariance 
function of, other than continuous differentiability. To make full Bayesian nonparametric inference, 
we may place on the boundary times {a^} independent homogeneous Poisson process priors, each 
with intensity AT Similarly to the previous section (Equation 16) our full prior specification of the 
string GP kernel reads 

'\ j 

~ d logAA(0,pt) 

y(j,k)jL(i, P )el±6 l p , 

where 6 J k is the vector of hyper-parameters driving the unconditional kernel kj.. The method de¬ 
veloped in the previous section and the resulting MCMC sampling scheme (Algorithm 2) may be 
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reused to sample from the posterior over function values, pending the following two changes. First, 
gradients Vf and Vf* are no longer necessary. Second, we may work with function values (f, f*) di¬ 
rectly (that is in the original as opposed to whitened space). The resulting (Gaussian) distribution of 
function values (f, f*) conditional on all other variables is then analytically derived using standard 
Gaussian identities, like it is done in vanilla Gaussian process regression, so that the within-model 
update of (f, f*) is performed using a single draw from a multivariate Gaussian. 

This approach to model complexity learning is advantageous over the information criteria al¬ 
ternative of Section 5.1 in that it scales better with large input-dimensions. Indeed, rather than 
performing complete maximum marginal likelihood inference a number of times that grows expo¬ 
nentially with the input dimension, the approach of this section alternates between exploring a new 
combination of numbers of kernel configurations in each input dimension, and exploring function 
values and kernel hyper-parameters (given their number). That being said, this approach should 
only be considered as an alternative to commonly used kernels for small scale regression problems 
to enable the learning of local patterns. Crucially, it scales as poorly as the standard GP paradigm, 
and Algorithm 2 should be preferred for large scale problems. 


6. Experiments 

We now move on to presenting empirical evidence for the efficacy of string GPs in coping with 
local patterns in data sets, and in doing so in a scalable manner. Firstly we consider maximum 
marginal likelihood inference on two small scale problems exhibiting local patterns. We begin 
with a toy experiment that illustrates the limitations of the standard GP paradigm in extrapolating 
and interpolating simple local periodic patterns. Then, we move on to comparing the accuracy 
of Bayesian nonparametric regression under a string GP prior to that of the standard Gaussian 
process regression model and existing mixture-of-experts alternatives on the motorcycle data set of 
Silverman (1985), commonly used for the local patterns and heteroskedasticity it exhibits. Finally, 
we illustrate the performance of the previously derived MCMC sampler on two large scale Bayesian 
inference problems, namely the prediction of U.S. commercial airline arrival delays of Hensman 
et al. (2013) and a new large scale dynamic asset allocation problem. 


6.1 Extrapolation and Interpolation of Synthetic Local Patterns 

In our first experiment, we illustrate a limitation of the standard approach consisting of postulating 
a global covariance structure on the domain, namely that this approach might result in unwanted 
global extrapolation of local patterns, and we show that this limitation is addressed by the string GP 
paradigm. To this aim, we use 2 toy regression problems. We consider the following functions: 

* _ f sin(607rf) t € [0,0.5] , , . _ J sin(167rt) t € [0,0.5] 

° \ ^sin(167rf) t e]0.5,1] ’ 1 \ ^ sin(327rt) t e]0.5,1] ’ 

/o (resp. /i) undergoes a sharp (resp. mild) change in frequency and amplitude at t = 0.5. We con¬ 
sider using their restrictions to [0.25, 0.75] for training. We sample those restrictions with frequency 
300, and we would like to extrapolate the functions to the rest of their domains using Bayesian non¬ 
parametric regression. 

We compare marginal likelihood string GP regression models, as described in Section 5.1, to 
vanilla GP regression models using popular and expressive kernels. All string GP models have two 
strings and the partition is learned in the marginal likelihood maximisation. Figure 5 illustrates 
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plots of the posterior means for each kernel used, and Table 2 compares predictive errors. Overall, it 
can be noted that the string GP kernel with the periodic kernel (MacKay (1998)) as building block 
outperforms competing kernels, including the expressive spectral mixture kernel 

K 

ksM(r) = ^ cr| exp(-2vr 2 r 2 7f) cos(2vrr^fc) 
k =l 


of Wilson and Adams (2013) with K = 5 mixture components . 12 

The comparison between the spectral mixture kernel and the string spectral mixture kernel is 
of particular interest, since spectral mixture kernels are pointwise dense in the family of stationary 
kernels, and thus can be regarded as flexible enough for learning stationary kernels from the data. 
In our experiment, the string spectral mixture kernel with a single mixture component per string sig¬ 
nificantly outperforms the spectral mixture kernel with 5 mixture components. This intuitively can 
be attributed to the fact that, regardless of the number of mixture components in the spectral mixture 
kernel, the learned kernel must account for both types of patterns present in each training data set. 
Hence, each local extrapolation on each side of 0.5 will attempt to make use of both amplitudes and 
both frequencies evidenced in the corresponding training data set, and will struggle to recover the 
true local sine function. We would expect that the performance of the spectral mixture kernel in this 
experiment will not improve drastically as the number of mixture components increases. However, 
under a string GP prior, the left and right hand side strings are independent conditional on the (un¬ 
known) boundary conditions. Therefore, when the string GP domain partition occurs at time 0.5, 
the training data set on [0.25, 0.5] influences the hyper-parameters of the string to the right of 0.5 
only to the extent that both strings should agree on the value of the latent function and its derivative 
at 0.5. To see why this is a weaker condition, we consider the family of pair of functions: 

(awi sin(w 2 t), otU 2 sin(cjif)), im = 2ttki, fcj GN, a E M. 

Such functions always have the same value and derivative at 0.5, regardless of their frequencies, 
and they are plausible GP paths under a spectral mixture kernel with one single mixture component 
(pk = ki and 7 ^ <C 1), and under a periodic kernel. As such it is not surprising that extrapolation 
under a string spectral mixture kernel or a string periodic kernel should perform well. 

To further illustrate that string GPs are able to learn local patterns that GPs with commonly 
used and expressive kernels can’t, we consider interpolating two bivariate functions f 2 and fe that 
exhibit local patterns. The functions are defined as: 

Vu, v € [0.0,1.0] f 2 (u,v) = f 0 (u)fi(v), / 3 (u,u) = \Jfo{u) 2 + fi{v) 2 . (46) 

We consider recovering the original functions as the posterior mean of a GP regression model trained 
on [0.0,0.4] U [0.6,1.0] x [0.0, 0.4] U [0.6,1.0]. Each bivariate kernel used is a product of two 
univariate kernels in the same family, and we used standard Kronecker techniques to speed-up 
inference (see Saatchi, 2011, p.134). The univariate kernels we consider are the same as previously. 
Each univariate string GP kernel has one change-point (two strings) whose position is learned by 
maximum marginal likelihood. Results are illustrated in Figures 6 and 7. Once again it can be 
seen that unlike any competing kernel, the product of string periodic kernels recover both functions 
almost perfectly. In particular, it is impressive to see that, despite fs not being a separable function, 

12. The sparse spectrum kernel of Lazaro-Gredilla et al. (2010) can be thought of as the special case 7 *. -C 1. 
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a product of string periodic kernels recovered it almost perfectly. The interpolations performed 
by the spectral mixture kernel (see Figures 6 and 7) provide further evidence for our previously 
developed narrative: the spectral mixture kernel tries to blend all local patterns found in the training 
data during the interpolation. The periodic kernel learns a single global frequency characteristic 
of the whole data set, ignoring local patterns, while the squared exponential, Matern and rational 
quadratic kernels merely attempt to perform interpolation by smoothing. 

Although we used synthetic data to ease illustrating our argument, it is reasonable to expect that 
in real-life problems the bigger the data set, the more likely there might be local patterns that should 
not be interpreted as noise and yet arc not indicative of the data set as whole. 
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Figure 5: Extrapolation of two functions /o and fi through Bayesian nonparametric regression un¬ 
der string GP priors and vanilla GP priors with popular and expressive kernels. Each 
model is trained on [0.25,0.5] and extrapolates to [0,1.0]. 
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6.2 Small Scale Heteroskedastic Regression 

In our second experiment, we consider illustrating the advantage of the string GP paradigm over 
the standard GP paradigm, but also over the alternatives of Kim et al. (2005), Gramacy and Lee 
(2008), Tresp (2000) and Deisenroth and Ng (2015) that consist of considering independent GP ex¬ 
perts on disjoint parts of the domain or handling disjoint subsets of the data. Using the motorcycle 
data set of Silverman (1985), commonly used for the local patterns and heteroskedasticity it ex¬ 
hibits, we show that our approach outperforms the aforementioned competing alternatives, thereby 
providing empirical evidence that the collaboration between consecutive GP experts introduced in 
the string GP paradigm vastly improves predictive accuracy and certainty in regression problems 
with local patterns. We also illustrate learning of the derivative of the latent function, solely from 
noisy measurements of the latent function. 

The observations consist of accelerometer readings taken through time in an experiment on the 
efficacy of crash helmets. It can be seen at a glance in Figure 8 that the data set exhibits roughly 
4 regimes. Firstly, between 0ms and 15ms the acceleration was negligible. Secondly, the impact 
slowed down the helmet, resulting in a sharp deceleration between 15ms and 28ms. Thirdly, the 
helmet seems to have bounced back between 28ms and 32ms, before it finally gradually slowed 
down and came to a stop between 32ms and 60ms. It can also be noted that the measurement noise 
seems to have been higher in the second half of the experiment. 

We ran 50 independent random experiments, leaving out 5 points selected uniformly at random 
from the data set for prediction, the rest being used for training. The models we considered include 
the vanilla GP regression model, the string GP regression model with marginal maximum likelihood 
inference as described in Section 5.1, mixtures of independent GP experts acting on disjoint subsets 
of the data both for training and testing, the Bayesian committee machine (Tresp (2000)), and the 
robust Bayesian committee machine (Deisenroth and Ng (2015)). We considered string GPs with 4 
and 6 strings whose boundary times are learned as part of the maximum likelihood inference. For 
consistency, we used the resulting partitions of the domain to define the independent experts in the 
competing alternatives we considered. The Matern 3/2 kernel was used throughout. The results 
are reported in Table 3. To gauge the ability of each model to capture the physics of the helmets 
crash experiment, we have also trained all models with all data points. The results are illustrated in 
Figures 8 and 9. 

It can be seen at a glance from Figure 9 that mixtures of independent GP experts are inappro¬ 
priate for this experiment as i) the resulting posterior means exhibit discontinuities (for instance at 
t = 30ms and t = 40ms) that are inconsistent with the physics of the underlying phenomenon, and 
ii) they overfit the data towards the end. The foregoing discontinuities do not come as a surprise as 
each GP regression expert acts on a specific subset of the domain that is disjoint from the ones used 
by the other experts, both for training and prediction. Thus, there is no guarantee of consistency 
between expert predictions at the boundaries of the domain partition. Another perspective to this 
observation is found in noting that postulating independent GP experts, each acting on an element of 
a partition of the domain, is equivalent to putting as prior on the whole function a stochastic process 
that is discontinuous at the boundaries of the partition. Thus, the posterior stochastic process should 
not be expected to be continuous at the boundaries of the domain either. 

This discontinuity issue is addressed by the Bayesian committee machine (BCM) and the robust 
Bayesian committee machine (rBCM) because, despite each independent expert being trained on a 
disjoint subset of the data, each expert is tasked with making predictions about all test inputs, not 
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Figure 6: Extrapolation of a synthetic function /2 (top left corner), cropped in the middle for train¬ 
ing (top right corner), using string GP regression and vanilla GP regression with various 
popular and expressive kernels. 
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Figure 7: Extrapolation of a synthetic function fe (top left corner), cropped in the middle for train¬ 
ing (top right corner), using string GP regression and vanilla GP regression with various 
popular and expressive kernels. 
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just the ones that fall into its input subspace. Each GP expert prediction is therefore continuous on 
the whole input domain, 13 and the linear weighting schemes operated by the BCM and the rBCM on 
expert predictions to construct the overall predictive mean preserve continuity. However, we found 
that the BCM and the rBCM suffer from three pitfalls. First, we found them to be less accurate 
than any other alternative out-of-sample on this data set (see Table 3). Second, their predictions 
of latent function values are overly uncertain. This might be due to the fact that, each GP expert 
being trained only with training samples that lie on its input subspace, its predictions about test 
inputs that lie farther away from its input subspace will typically be much more uncertain, so that, 
despite the weighting scheme of the Bayesian committee machine putting more mass on ‘confident’ 
experts, overall the posterior variance over latent function values might still be much higher than in 
the standard GP paradigm for instance. This is well illustrated by both the last column of Table 3 
and the BCM and rBCM plots in Figure 9. On the contrary, no string GP model suffers from this 
excess uncertainty problem. Third, the posterior means of the BCM, the rBCM and the vanilla GP 
regression exhibit oscillations towards the end (t > 40ms) that are inconsistent with the experimen¬ 
tal setup; the increases in acceleration as the helmet slows down suggested by these posterior means 
would require an additional source of energy after the bounce. 

In addition to being more accurate and more certain about predictions than vanilla GP regres¬ 
sion, the BCM and the rBCM (see Table 3), string GP regressions yield posterior mean acceleration 
profiles that are more consistent with the physics of the experiment: steady speed prior to the shock, 
followed by a deceleration resulting from the shock, a brief acceleration resulting from the change in 
direction after the bounce, and finally a smooth slow down due to the dissipation of kinetic energy. 
Moreover, unlike the vanilla GP regression, the rBCM and the BCM, string GP regressions yield 
smaller posterior variances towards the beginning and the end of the experiment than in the middle, 
which is consistent with the fact that the operator would be less uncertain about the acceleration at 
the beginning and at the end of the experiment—one would indeed expect the acceleration to be null 
at the beginning and at the end of the experiment. This desirable property can be attributed to the 
heteroskedasticity of the noise structure in the string GP regression model. 

We also learned the derivative of the latent acceleration with respect to time, purely from noisy 
acceleration measurements using the joint law of a string GP and its derivative (Theorem 2). This 
is illustrated in Figure 8. 


13. So long as the functional prior is continuous, which is the case here. 
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String GP (6 strings) 



String GP (6 strings) 



Time (ms) 


Figure 8: Posterior mean ± 2 predictive standard deviations on the motorcycle data set (see Silver- 
man, 1985), under a Matern 3/2 derivative string GP prior with 6 learned strings. The top 
figure shows the noisy accelerations measurements and the learned latent function. The 
bottom function illustrates the derivative of the acceleration with respect to time learned 
from noisy acceleration samples. Posterior credible bands are over the latent functions 
rather than noisy measurements, and as such they do not include the measurement noise. 
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6.3 Large Scale Regression 

To illustrate how our approach fares against competing alternatives on a standard large scale prob¬ 
lem, we consider predicting arrival delays of commercial flights in the USA in 2008 as studied by 
Hensman et al. (2013). We choose the same covariates as in Hensman et al. (2013), namely the 
age of the aircraft (number of years since deployment), distance that needs to be covered, airtime, 
departure time, arrival time, day of the week, day of the month and month. Unlike Hensman et al. 
(2013) who only considered commercial flights between January 2008 and April 2008, we consider 
commercial throughout the whole year, for a total of 5.93 million records. In addition to the whole 
data set, we also consider subsets so as to empirically illustrate the sensitivity of computational time 
to the number of samples. Selected subsets consist of 10,000, 100,000 and 1,000,000 records 
selected uniformly at random. For each data set, we use 2/3 of the records selected uniformly at 
random for training and we use the remaining 1/3 for testing. In order to level the playing field 
between stationary and nonstationary approaches, we normalize training and testing data sets. 14 As 
competing alternatives to string GPs we consider the SVIGP of Hensman et al. (2013), the Bayesian 
committee machines (BCM) of Tresp (2000), and the robust Bayesian committee machines (rBCM) 
of Deisenroth and Ng (2015). 

As previously discussed the prediction scheme operated by the BCM is Kolmogorov-inconsistent 
in that the resulting predictive distributions are not consistent by marginalization. 15 Moreover, 
jointly predicting all function values by using the set of all test inputs as query set, as originally 
suggested in Tresp (2000), would be impractical in this experiment given that the BCM requires 
inverting a covariance matrix of the size of the query set which, considering the numbers of test in¬ 
puts in this experiment (which we recall can be as high as 1.97 million), would be computationally 
intractable. To circumvent this problem we use the BCM algorithm to query one test input at a time. 
This approach is in-line with that adopted by Deisenroth and Ng (2015), where the authors did not 
address determining joint predictive distributions over multiple latent function values. For the BCM 
and rBCM, the number of experts is chosen so that each expert processes 200 training points. For 
SVIGP we use the implementation made available by the The GPy authors (2012-2016), and we use 
the same configuration as in Hensman et al. (2013). As for string GPs, we use the symmetric sum 
as link function, and we run two types of experiments, one allowing for inference of change-points 
(String GP), and the other enforcing a single kernel configuration per input dimension (String GP*). 
The parameters a and /3 are chosen so that the prior mean number of change-points in each input 
dimension is 5% of the number of distinct training and testing values in that input dimension, and 
so that the prior variance of the foregoing number of change-points is 50 times the prior mean—the 
aim is to be uninformative about the number of change-points. We run 10, 000 iterations of our 
RJ-MCMC sampler and discarded the first 5,000 as ‘burn-in’. After burn-in we record the states 
of the Markov chains for analysis using a 1-in-100 thinning rate. Predictive accuracies are reported 
in Table 4 and CPU time requirements 16 are illustrated in Figure 10. We stress that all experiments 
were run on a multi-core machine, and that we prefer using the cumulative CPU clock resource 


14. More precisely, we substract from every feature sample (both in-sample and out-of-sample) the in-sample mean of 
the feature and we divide the result by the in-sample standard deviation of the feature. 

15. For instance the predictive distribution of the value of the latent function at a test input an, namely f(x i), obtained 
by using {an} as set of test inputs in the BCM, differs from the predictive distribution obtained by using {an, an} as 
set of test inputs in the BCM and then marginalising with respect to the second input an. 

16. We define CPU time as the cumulative CPU clock resource usage of the whole experiment (training and testing), 
across child processes and threads, and across CPU cores. 
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as time complexity metric, instead of wall-clock time, so as to be agnostic to the number of CPU 
cores used in the experiments. This metric has the merit of illustrating how the number of CPU 
cores required grows as a function of the number of training samples for a fixed/desired wall-clock 
execution time, but also how the wall-clock execution time grows as a function of the number of 
training samples for a given number of available CPU cores. 

The BCM and the rBCM perform the worst in this experiment both in terms of predictive ac¬ 
curacy (Table 4) and total CPU time (Figure 10). The poor scalability of the BCM and the rBCM 
is primarily due to the testing phase. Indeed, if we denote M the total number of experts, then 
M = \ 3 y Q ], as each expert processes 200 training points, of which there are | N. In the prediction 
phase, each expert is required to make predictions about all |_ZV test inputs, which requires evalu¬ 
ating M products of an | N x 200 matrix with a 200 x 200 matrix, which results in a total CPU 
time requirement that grows in 0(M^N200 2 ), which is the same as 0(N 2 ). Given that training 
CPU time grows linearly in N the cumulative training and testing CPU time grows quadratically 
in N. This is well illustrated in Figure 10, where it can be seen that the slopes of total CPU time 
profiles of the BCM and the rBCM in log-log scale are approximately 2. The airline delays data 
set was also considered by Deisenroth and Ng (2015), but the authors restricted themselves to a 
fixed size of the test set of 100, 000 points. However, this limitation might be restrictive as in many 
‘smoothing’ applications, the test data set can be as large as the training data set—neither the BCM 
nor the rBCM would be sufficiently scalable in such applications. 

As for SVIGP, although it was slightly more accurate than string GPs on this data set, it can be 
noted from Figure 10 that string GPs required 10 times less CPU resources. In fact we were unable 
to run the experiment on the full data set with SVIGP—we gave up after 500 CPU hours, or more 
than a couple of weeks wall-clock time given that the GPy implementation of SVIGP makes little 
use of multiple cores. As a comparison, the full experiment took 91.0 hours total CPU time (« 15 
hours wall-clock time on our 8 cores machine) when change-points were inferred and 83.11 hours 
total CPU time (« 14 hours wall-clock time on our 8 cores machine) when change-points were not 
inferred. Another advantage of additively separable string GPs over GPs, and subsequently over 
SVIGP, is that they are more interpretable. Indeed, one can determine at a glance from the learned 
posterior mean string GPs of Figure 11 the effect of each of the 8 covariates considered on arrival 
delays. It turns out that the three most informative factors in predicting arrival delays are departure 
time, distance and arrival time, while the age of the aircraft, the day of the week and the day of 
the month seem to have little to no effect. Finally, posterior distributions of the number of change- 
points are illustrated in Figure 12, and posterior distributions of the locations of change-points are 
illustrated in Figure 13. 
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N 

String GP 

String GP* 

BCM 

rBCM 

SVIGP 

10,000 

100,000 

1,000,000 

5,929,413 

1.03 ±0.10 

0.93 ±0.03 

0.93 ±0.01 

0.90 ±0.01 

1.06 ±0.10 

0.96 ±0.03 

0.92 ±0.01 

0.93 ±0.01 

1.06 ±0.10 

1.66 ±0.03 

N/A 

N/A 

1.06 ±0.10 

1.04 ±0.04 
N/A 

N/A 

0.90 ±0.09 
0.88 ±0.03 
0.82 ± 0.01 

N/A 


Table 4: Predictive mean squared errors (MSEs) ± one standard error on the airline arrival delays 
experiment. Squared errors are expressed as fraction of the sample variance of airline 
arrival delays, and hence are unitless. With this normalisation, a MSE of 1.00 is as good 
as using the training mean arrival delays as predictor. The * in String GP* indicates that 
inference was performed without allowing for change-points. N/A entries correspond to 
experiments that were not over after 500 CPU hours. 



Figure 10: Total CPU time (training and testing) taken by various regression approaches on the 
airline delays data set as a function of the size of the subset considered, in log-log scale. 
The experimental setup is described in Section 6.3. The CPU time reflects actual CPU 
clock resource usage in each experiment, and is therefore agnostic to the number of CPU 
cores used. It can be regarded as the wall-clock time the experiment would have taken 
to complete on a single-core computer (with the same CPU frequency). Dashed lines 
are extrapolated values, and correspond to experiments that did not complete after 500 
hours of CPU time. 
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Figure 11: Posterior mean ± one posterior standard deviation of univariate string GPs in the airline 
delays experiment of Section 6.3. Change-points were automatically inferred in this 
experiment. ^ 
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Figure 12: Posterior distributions of the numbers of change-points in each input dimension in the 
airline delays experiment of Section 6.3. 
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Figure 13: Posterior distributions of the locations of change-points in each input dimension in the 
airline delays experiment of Section 6.3. Dimensions that were learned to exhibit no 
change-point have been omitted here. 
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6.4 Large Scale Dynamic Asset Allocation 

An important feature of our proposed RJ-MCMC sampler (Algorithm 2) is that, unlike the BCM, 
the rBCM and SVIGP, which are restricted to Bayesian nonparametric regression and classification, 
Algorithm 2 is agnostic with regard to the likelihood model, so long as it takes the form p(V |f, u). 
Thus, it may be used as is on a wide variety of problems that go beyond classification and regression. 
In this experiment we aim to illustrate the efficacy of our approach on one such large scale problem 
in quantitative finance. 


6.4.1 Background 

Let ( Xi(t)) t>0 for i = 1,..., n be n stock price processes. Let ( Xi(t)) t>0 for i = 1,..., n denote 
the market capitalisation processes, that is 2Q(t) = rii(t)xi(t ) where rijit) is the number of shares 
in company i trading in the market at time t. We call long-only portfolio any vector-valued stochastic 
process 7r = (7Ti,..., 7r n ) taking value on the unit simplex on M n , that is 

n 

Vi, t, 7Tj(f) > 0 and 

i =1 


Each process 7u represents the proportion of an investor’s wealth invested in (holding) shares in 
asset i. An example long-only portfolio is the market portfolio // = (// 1 ,..., fi n ) where 




Xj(t) 

X\ (t) + • • • + X n (t) 


(47) 


is the market weight of company i at time t, that is its size relative to the total market size (or that 
of the universe of stocks considered). The market portfolio is very important to practitioners as it 
is often perceived not be to subject to idiosyncracies, but only to systemic risk. It is often used as 
an indicator of how the stock market (or a specific universe of stocks) performs as a whole. We 
denote Z n the value process of a portfolio it with initial capital Z n ( 0). That is, Z n (t) is the wealth 
at time t of an investor who had an initial wealth of Z n ( 0), and dynamically re-allocated all his 
wealth between the n stocks in our universe up to time t following the continuous-time strategy it. 

A mathematical theory has recently emerged, namely stochastic portfolio theory (SPT) (see 
Karatzas and Fernholz, 2009) that studies the stochastic properties of the wealth processes of cer¬ 
tain portfolios called functionally-generated portfolio under realistic assumptions on the market 
capitalisation processes (X l (t)) f >u . Functionally-generated portfolios are rather specific in that 
the allocation at time t, namely (tti (t), ... , 7r n (i)), solely depends on the market weights vector 
(pi(i ),..., p n (t)). Nonetheless, some functionally-generated portfolios 7r* have been found that, 
under the mild (so-called diversity) condition 


3(tmax;0 <C fimax 1 S.t. Vi ft , t T, P ’ iif ) fimax; (48) 

outperform the market portfolio over the time horizon [0, T] with probability one (see Vervuurt and 
Karatzas, 2015; Karatzas and Fernholz, 2009). More precisely, 


P (Z n *(T) > Zfj,(T)) = 1 and P (Z n * (T) > Z^{T)) > 0. (49) 


Galvanized by this result, we here consider the inverse problem consisting of learning from his¬ 
torical market data a portfolio whose wealth process has desirable user-specified properties. This 


54 



String and Membrane Gaussian Processes 


inverse problem is perhaps more akin to the problems faced by investment professionals: i) their 
benchmarks depend on the investment vehicles pitched to investors and may vary from one vehicle 
to another, ii) they have to take into account liquidity costs, and iii) they often find it more valu¬ 
able to go beyond market weights and leverage multiple company characteristics in their investment 
strategies. 

6.4.2 Model Construction 

We consider portfolios ir? = tt„ j of the form 



__ 

/(ct(f))H-h/(c n (t))’ 


(50) 


where Cj(f) <E W l are some quantifiable characteristics of asset i that may be observed in the market 
at time t, and / is a positive-valued function. Portfolios of this form include all functionally- 
generated portfolios studied in SPT as a special case. 17 A crucial departure of our approach from 
the aforementioned type of portfolios is that the market characteristics processes c h need not be 
restricted to size-based information, and may contain additional information such as social me¬ 
dia sentiments, stock price path-properties, but also characteristics relative to other stocks such 
as performance relative to the best/worst performing stock last week/month/year etc. We place a 
mean-zero string GP prior on log /. Given some historical data V corresponding to a training time 
horizon [0, T], the likelihood model p (X>|7r^) is defined by the investment professional and reflects 
the extent to which applying the investment strategy nf over the training time horizon would have 
achieved a specific investment objective. An example investment objective is to achieve a high 
excess return relative to a benchmark portfolio a 

U ER (t r /) = logger) - log Z a (T). (51) 

a can be the market portfolio (as in SPT) or any stock index. Other risk-adjusted investment objec¬ 
tives may also be used. One such objective is to achieve a high Sharpe-ratio, defined as 


(fsR 



rV 252 


(52) 


where the time t is in days, r(t) := log Z n f it) — log Z^f it — 1) are the daily returns the portfolio 
ii-i and r = p Ylt= l r W its average daily return. More generally, denoting U (vr-^) the performance 
of the portfolio rJ over the training horizon [0, T] (as per the user-defined investment objective), 
we may choose as likelihood model a distribution over U (tt-T) that reflects what the investment 
professional considers good and bad performance. For instance, in the case of the excess return 
relative to a benchmark portfolio or the Sharpe ratio, we may choose U (it*') to be supported on 
]0, +oo[ (for instance U (W) can be chosen to be Gamma distributed) so as to express that portfolios 
that do not outperform the benchmark or loose money overall in the training data are not of interest. 
We may then choose the mean and standard deviation of the Gamma distribution based on our 

17. We refer the reader to Karatzas and Femholz (2009) for the definition of functionally-generated portfolios. 
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expectation as to what performance a good candidate portfolio can achieve, and how confident we 
feel about this expectation. Overall we have, 

p (v\n f ^ = 7 (u (n f ^ ; a e , f3 e ^) , (53) 

where 7 (.; a, (5) is the probability density function of the Gamma distribution. Noting, from Equa¬ 
tion (50) that Trf(t) only depends on / through its values at (ci(t),..., c n (t)), and assuming that 
U ( 7 N) only depends on it? evaluated at a finite number of times (as it is the case for excess returns 
and the Sharpe ratio), it follows that only depends on f, a vector of values of / at a finite 

number of points. Hence the likelihood model, which we may rewrite as 

p(P\f) = 7 (lA (V/) ; a e , (3 e ) , (54) 

is of the form required by the RJ-MCMC sampler previously developed. By sampling from the 
posterior distribution p( f, f*. Vf, Vf*|D), the hope is to learn a portfolio that did well during the 
training horizon, to analyse the sensitivity of its investment strategy to the underlying market char¬ 
acteristics through the gradient of /, and to evaluate the learned investment policy on future market 
conditions. 


6.4.3 Experimental Setup 


The universe of stocks we considered for this experiment are the constituents of the S&P 500 in¬ 
dex, accounting for changes in constituents over time and corporate events. We used the period 1 st 
January 1990 to 31 st December 2004 for training and we tested the learned portfolio during the pe¬ 
riod 1 st January 2005 to 31 st December 2014. We rebalanced the portfolio daily, for a total of 2.52 
million input points at which the latent function / must be learned. We considered as market char¬ 
acteristics the market weight (CAP), the latest return on asset (ROA) defined as the ratio between 
the net yearly income and the total assets as per the latest balance sheet of the company known at 
the time of investment, the previous close-to-close return (PR), the close-to-close return before the 
previous (PR2), and the S&P long and short term credit rating (LCR and SCR). While the market 
weight is a company size characteristic, the ROA reflects how well a company performs relative to 
its size, and we hope that S&P credit ratings will help further discriminate successful companies 
from others. The close-to-close returns are used to learn possible ‘momentum’ patterns from the 
data. The data originate from the CRSP and Compustat databases. In the experiments we consid¬ 
ered as performance metric the annualised excess return Z^er-ewp relative to the equally-weighted 
portfolio. We found the equally-weighted portfolio to be a harder benchmark to outperform than the 
market portfolio. We chose a e and B e in Equation (54) so that the mean of the Gamma distribution 
is 10.0 and its variance 0.5, which expresses a very greedy investment target. 

It is worth pointing out that none of the scalable GP alternatives previously considered can cope 
with our likelihood model Equation (54). We compared the performance of the learned string GP 
portfolio out-of-sample to those of the best three SPT portfolios studied in Vervuurt and Karatzas 
(2015), namely the equally weighted portfolio 


and the diversity weighted portfolios 


7T 


EWP 


C t) 


1 

5 

n 


7r; 


DWP 


(Pp) = 




/u (t)p + ■ ■ ■ + n n (t)p’ 


(55) 


(56) 
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— String GP 

— DWP (p=0.5) 

— DWP (p=-0.5) 

— EWP 

— MKT 


Time 


Figure 14: Evolution of the wealth processes of various long-only trading strategies on the S&P 
500 universe of stocks between 1 st January 2005 (where we assume a starting wealth 
of 1) and 31 st December 2014. The String GP strategy was learned using market data 
from 1 st January 1990 to 31 st December 2004 as descried in Section 6.4. EWP refers 
to the equally-weighted portfolio, MKT refers to the market portfolio (which weights 
stocks proportionally to their market capitalisations) and DWP (p) refers to the diversity- 
weighted portfolio with exponent p (which weights stocks proportionally to the p-th 
power of their market weights). 


with parameter p equals to —0.5 and 0.5, and the market portfolio. Results are provided Table 5, and 
Figure 14 displays the evolution of the wealth process of each strategy. It can be seen that the learned 
string GP strategy considerably outperforms the next best SPT portfolio. This experiment not only 
demonstrates (once more) that string GPs scale to large scale problems, it also illustrates that our 
inference scheme is able to unlock commercial value in new intricate large scale applications where 
no alternative is readily available. In effect, this application was first introduced by Kom Samo and 
Vervuurt (2016), where the authors used a Gaussian process prior on a Cartesian grid and under a 
separable covariance function so as to speed up inference with Kronecker techniques. Although the 
resulting inference scheme has time complexity that is linear in the total number of points on the 
grid, for a given grid resolution, the time complexity grows exponentially in the dimension of the 
input space (that is the number of trading characteristics), which is impractical for d > 4. On the 
other hand, string GPs allow for a time complexity that grows linearly with the number of trading 
characteristics, thereby enabling the learning of subtler market inefficiencies from the data. 
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Strategy 

Shaipe Ratio 

Z v (T)/Zewp(T) 

Avg. Ann. Ret. 

String GP 

0.73 

2.87 

22.07% 

DWP (p = -0.5) 

0.55 

1.07 

10.56% 

EWP 

0.53 

1.00 

9.84% 

MKT 

0.34 

0.62 

4.77% 

DWP (p = 0.5) 

0.33 

0.61 

4.51% 


Table 5: Performance of various long-only trading strategies on the S&P 500 universe of stocks 
between 1 st January 2005 (where we assume a starting wealth of 1) and 31 st December 
2014. The String GP strategy was learned using market data from 1 st January 1990 to 31 st 
December 2004 as descried in Section 6.4. EWP refers to the equally-weighted portfolio, 
MKT refers to the market portfolio (which weights stocks proportionally to their market 
capitalisations) and DWP (p) refers to the diversity-weighted portfolio with exponent p 
(which weights stocks proportionally to the p -th power of the market weight of the asset). 
Z n (T ) denotes the terminal wealth of strategy ir, and Avg. Ann. Ret. is the strategy’s 
equivalent constant annual return over the test horizon. 


7. Discussion 

In this paper, we introduce a novel class of smooth functional priors (or stochastic processes), which 
we refer to as string GPs, with the aim of simultaneously addressing the lack of scalability and the 
lack of flexibility of Bayesian kernel methods. Unlike existing approaches, such as Gaussian process 
priors (Rasmussen and Williams (2006)) or student-t process priors (Shah et al. (2014)), which are 
parametrised by global mean and covariance functions, and which postulate fully dependent finite¬ 
dimensional marginals, the alternative construction we propose adopts a local perspective and the 
resulting finite-dimensional marginals exhibit conditional independence structures. Our local ap¬ 
proach to constructing string GPs provides a principled way of postulating that the latent function 
we wish to learn might exhibit locally homogeneous patterns, while the conditional independence 
structures constitute the core ingredient needed for developing scalable inference methods. More¬ 
over, we provide theoretical results relating our approach to Gaussian processes, and we illustrate 
that our approach can often be regarded as a more scalable and/or more flexible extension. We ar¬ 
gue and empirically illustrate that string GPs present an unparalleled opportunity for learning local 
patterns in small scale regression problems using nothing but standard Gaussian process regres¬ 
sion techniques. More importantly, we propose a novel scalable RJ-MCMC inference scheme to 
learn latent functions in a wide variety of machine learning tasks, while simultaneously determin¬ 
ing whether the data set exhibits local patterns, how many types of local patterns the data might 
exhibit, and where do changes in these patterns are likely to occur. The proposed scheme has time 
complexity and memory requirement that are both linear in the sample size N. When the number 
of available computing cores is at least equal to the dimension d of the input space, the time com¬ 
plexity is independent from the dimension of the input space. Else, the time complexity grows in 
O(dN). The memory requirement grows in O(dN). We empirically illustrate that our approach 
scales considerably better than competing alternatives on a standard benchmark data set, and is able 
to process data sizes that competing approaches cannot handle in a reasonable time. 
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7.1 Limitations 

The main limitation of our approach is that, unlike the standard GP paradigm in which the time 
complexity of marginal likelihood evaluation does not depend on the dimension of the input space 
(other than through the evaluation of the Gram matrix), the string GP paradigm requires a number 
of computing cores that increases linearly with the dimension of the input space, or alternatively has 
a time complexity linear in the input space dimension on single-core machines. This is a by-product 
of the fact that in the string GP paradigm , we jointly infer the latent function and its gradient. If the 
gradient of the latent function is inferred in the standard GP paradigm, the resulting complexity will 
also be linear in the input dimension. That being said, overall our RJ-MCMC inference scheme will 
typically scale better per iteration to large input dimensions than gradient-based marginal likelihood 
inference in the standard GP paradigm, as the latter typically requires numerically evaluating an 
Hessian matrix, which requires computing the marginal likelihood a number of times per iterative 
update that grows quadratically with the input dimension. In contrast, a Gibbs cycle in our MCMC 
sampler has worst case time complexity that is linear in the input dimension. 


7.2 Extensions 

Some of the assumptions we have made in the construction of string GPs and membrane GPs can 
be relaxed, which we consider in detail below. 


7.2.1 Stronger Global Regularity 

We could have imposed more (multiple continuous differentiability) or less (continuity) regularity 
as boundary conditions in the construction of string GPs. We chose continuous differentiability 
as it is a relatively mild condition guaranteed by most popular kernels, and yet the corresponding 
treatment can be easily generalised to other regularity requirements. It is also possible to allow for 
discontinuity at a boundary time «/. by replacing p h k and T, b k in Equation (5) with /. M ak and ak]ak 
respectively, or equivalently by preventing any communication between the k- th and the (/,: + l)-th 
strings. This would effectively be equivalent to having two independent string GPs on [ao, «■/,■] and 
]afc, ax]- 


7.2.2 Differential Operators as Link Functions 

Our framework can be further extended to allow differential operators as link functions, thereby 
considering the latent multivariate function to infer as the response of a differential system to inde¬ 
pendent univariate string GP excitations. The RJ-MCMC sampler we propose in Section 5 would 
still work in this framework, with the only exception that, when the differential operator is of first 
order, the latent multivariate function will be continuous but not differentiable, except if global reg¬ 
ularity is upgraded as discussed above. Moreover, Proposition 5 can be generalised to first order 
linear differential operators. 
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7.2.3 Distributed String GPs 

The RJ-MCMC inference scheme we propose may be easily adapted to handle applications where 
the data set is so big that it has to be stored across multiple clusters, and inference techniques have 
to be developed as data flow graphs 18 (for instance using libraries such as TensorFlow). 

To do so, the choice of string boundary times can be adapted so that each string has the same 
number of inner input coordinates, and such that in total there are as many strings across dimensions 
as a target number of available computing cores. We may then place a prior on kernel memberships 
similar to that of previous sections. Here, the change-points may be restricted to coincide with 
boundary times, and we may choose priors such that the sets of change-points are independent 
between input dimensions. In each input dimension the prior on the number of change-points can 
be chosen to be a truncated Poisson distribution (truncated to never exceed the total number of 
boundary times), and conditional on their number we may choose change-points to be uniformly 
distributed in the set of boundary times. In so doing, any two strings whose shared boundary time 
is not a change-point will be driven by the same kernel configuration. 

This new setup presents no additional theoretical or practical challenges, and the RJ-MCMC 
techniques previously developed arc easily adaptable to jointly learn change-points and function 
values. Unlike the case we developed in previous sections where an update of the univariate string 
GP corresponding to an input dimension, say the j-th, requires looping through all distinct j-th 
input coordinates, here no step in the inference scheme requires a full view of the data set in any 
input dimension. Full RJ-MCMC inference can be constructed as a data flow graph. An exam¬ 
ple such graph is constructed as follows. The leaves correspond to computing cores responsible 
for generating change-points and kernel configurations, and mapping strings to kernel configura¬ 
tions. The following layer is made of compute cores that use kernel configurations coming out of 
the previous layer to sequentially compute boundary conditions corresponding to a specific input 
dimension—there are d such compute cores, where d is the input dimension. These compute cores 
then pass computed boundary conditions to subsequent compute cores we refer to as string com¬ 
pute cores. Each string compute core is tasked with computing derivative string GP values for a 
specific input dimension and for a specific string in that input dimension, conditional on previously 
computed boundary conditions. These values are then passed to a fourth layer of compute cores, 
each of which being tasked with computing function and gradient values corresponding to a small 
subset of training inputs from previously computed derivative string GP values. The final layers 
then computes the log-likelihood using a distributed algorithm such as Map-Reduce when possible. 
This proposal data flow graph is illustrated Figure 15. 

We note that the approaches of Kim et al. (2005), Gramacy and Lee (2008), Tresp (2000), 
and Deisenroth and Ng (2015) also allow for fully-distributed inference on regression problems. 
Distributed string GP RJ-MCMC inference improves on these in that it places little restriction on 
the type of likelihood. Moreover, unlike Kim et al. (2005) and Gramacy and Lee (2008) that yield 
discontinuous latent functions, string GPs are continuously differentiable, and unlike Tresp (2000) 
and Deisenroth and Ng (2015), local experts in the string GP paradigm (i.e. strings) are driven by 
possibly different sets of hyper-parameters, which facilitates the learning of local patterns. 


18. A data flow graph is a computational (directed) graph whose nodes represent calculations (possibly taking place on 
different computing units) and directed edges correspond to data flowing between calculations or computing units. 
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7.2.4 Approximate MCMC for I.I.D. Observations Likelihoods 

As discussed in Section 5.2, the bottleneck of our proposed inference scheme is the evaluation 
of likelihood. When the likelihood factorises across training samples, the linear time complexity 
of our proposed approach can be further reduced using a Monte Carlo approximation of the log- 
likelihood (see for instance Bardenet et al. (2014) and references therein). Although the resulting 
Markov chain will typically not converge to the true posterior distribution, in practice its stationary 
distribution can be sufficiently close to the true posterior when reasonable Monte Carlo sample sizes 
are used. Convergence results of such approximations have recently been studied by Bardenet et al. 
(2014) and Alquier et al. (2016). We expect this extension to speed-up inference when the number 
of compute cores is in the order of magnitude of the input dimension, but we would recommend the 
previously mentioned fully-distributed string GP inference extension when compute cores are not 
scarce. 

7.2.5 Variational Inference 

It would be useful to develop suitable variational methods for inference under string GP priors, 
that we hope will scale similarly to our proposed RJ-MCMC sampler but will converge faster. We 
anticipate that the main challenge here will perhaps be the learning of model complexity, that is the 
number of distinct kernel configurations in each input dimension. 
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they are responsible for. These values are then used to compute the log-likelihood in a distributed fashion using the Map-Reduce 
algorithm. Each calculation in the corresponding RJ-MCMC sampler would be initiated at one of the compute cores, and would 
trigger updates of all edges accessible from that compute core. 
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Appendix A. 

We begin by recalling Kolmogorov’s extension theorem, which we will use to prove the existence 
of derivative Gaussian processes and string Gaussian processes. 


Theorem 7 (Kolmogorov’s extension theorem. (0ksendal, 2003, Theorem 2.1.5)) 

Let I be an internal, let all t\,... ,L £ I, i,n £ N*, let be probability measures on 

such that: 

(-^7r(l)> • • • ;-^7r(i)) — v ti,...,ti (-^1) • • • j ^i) (57) 

for all permutations it on {1,... ,i} and 


(7*1 j • • • j Ff) (7*1 j • • • j Fif 


(58) 


for all m £ N* where the set on the right hand side has a total of i + m factors. Then there exists a 
probability space (0, F. P) and an M n valued stochastic process (Xf/af on LI, 


X t : n M n 


such that 

v tl ,...,u(Fi,...,Fi)=F{X tl £ F 1 ,...,X ti £ F0 (59) 

for all ti,... ,ti £ I, i £ N* and for all Borel sets F\,..., Fj. 

It is easy to see that every stochastic process satisfies the permutation and marginalisation condi¬ 
tions (57) and (58). The power of Kolmogorov’s extension theorem is that it states that those two 
conditions are sufficient to guarantee the existence of a stochastic process. 

Appendix B. Proof of Proposition 1 

In this section we prove Proposition 1, which we recall below. 

Proposition 1 (Derivative Gaussian processes) 

Let / be an interval, fc:/x/->laC 2 symmetric positive semi-definite function, 19 rn : / Ala 
C 1 function. 

(A) There exists a M 2 -valued stochastic process (D t ) teI , D t = (zt, z'f), such that for all /; |..... t n £ 

I, 

(z tl ,...,z tn ,z't 11 ...,z' tn ) 

19. C 1 (resp. C 2 ) functions denote functions that are once (resp. twice) continuously differentiable on their domains. 
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is a Gaussian vector with mean 


■ ■ ■ 



and covariance matrix such that 




We herein refer to (D t )tei as a derivative Gaussian process. 

(B) ( zt)tei is a Gaussian process with mean function m, covariance function k and that is C 1 in the 
L 2 (mean square) sense. 

(C) ( z' t )tei is a Gaussian process with mean function ^ and covariance function .^2^. Moreover, 
{z' t )tei is the L 2 derivative of the process ( zt)tei■ 

Proof 

Appendix B.l Proof of Proposition 1 (A) 

Firstly, we need to show that the matrix suggested in the proposition as the covariance matrix of 
{zt A ■..., Zt n ,z' tl ,, z[ n ) is indeed positive semi-definite. To do so, we will show that it is the 
limit of positive definite matrices (which is sufficient to conclude it is positive semi-definite, as 
x T M n x > 0 for a convergent sequence of positive definite matrices implies x T M OQ x > 0). 

Let k be as in the proposition, h such that Vi < n, U + h £ I and (zt)t&i be a Gaussian process 
with covariance function k. The vector 



is a Gaussian vector whose covariance matrix is positive definite and such that 


cov (z ti ,z tj ) = k(ti,tj), 


( 60 ) 



( 61 ) 


and 



cov 


( 62 ) 


As k is C 2 , h —t k(x, y + h) admits a second order Taylor expansion about h = 0 for every x, and 
we have: 
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Similarly, h —> k(x + h,y + h) admits a second order Taylor expansion about h = 0 for every x. y 
and we have: 



(64) 


Hence, 



(65) 


and 



Dividing Equation (65) by h, dividing Equation (66) by h 2 , and taking the limits, we obtain: 



and 



which corresponds to the covariance structure of Proposition 1. In other words the proposed covari¬ 
ance structure is indeed positive semi-definite. 

Let tn be the Gaussian probability measure corresponding to the joint distribution of 
(z tl ,..., Zt n , z ' tl ,... , z[ ) as per the Proposition 1 , and let u/ | ) t be the measure on the Borel 
cr-algebra £>(M 2 x • • • x M 2 ) such that for any 2 n intervals In, /12,..., I n 1, I n 2, 


n times 



The measures ^ arc the finite dimensional measures corresponding to the stochastic object 
(D t )ta sampled at times t±,..., t n . They satisfy the time permutation and marginalisation con¬ 
ditions of Kolmogorov’s extension theorem as the Gaussian measures t do. Hence, the Un¬ 
valued stochastic process {D t )t&i defined in Proposition 1 does exist. 

Appendix B.2 Proof of Proposition 1 (B) 

That ( zt)tei is a Gaussian process results from the fact that the marginals (z tl ,..., z t , n ) are Gaus¬ 
sian vectors with mean (m(t 1 ),..., m(t n )) and covariance matrix [k(U, Th e f act th at 

( zt)tei is C 1 in the L 2 sense is a direct consequence of the twice continuous differentiability of k. 
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Appendix B.3 Proof of Proposition 1 (C) 

In effect, it follows from Proposition 1(A) that ~ t+h h Zt — z[ is a Gaussian random variable with 
mean 

m(t + h) — m(t) dm 
h ~dT * 

and variance 

k(t + h,t + h)~ 2 k(t + h,t) + k(t,t ) - 2 + h,t)h + 2 ^(t,t)h + ^^(t,t)h 2 

h 2 ' 


Taking the second order Taylor expansion of the numerator in the fraction above about h = 0 we 
get o(h 2 ), hence 


We also have 


Therefore, 


lim Var 
h—>o 


Zt+h h Zt -41=0. 


(zt + h-z t A dm , 


lim E 


lim E 

/i-K) 


Z' Zt+h - ^ / 

1“S— 


= 0 , 


which proves that ( z [) is the L 2 derivative of (zj. The fact that ( 4 ) is a Gaussian process with 
mean function and covariance function AAl j s a di rect consequence of the distribution of the 
marginals (z' tl . ..., z' tn ). Moreover, the continuity of (z' t ) in the L 2 sense is a direct consequence of 
the continuity of (see Rasmussen and Williams, 2006, p. 81 4.1.1). ■ 


Appendix C. Proof of Theorem 2 

In this section we prove Theorem 2 which we recall below. 

Theorem 2 (String Gaussian process) 

Let ao < ■ ■ ■ < ak < ■ ■ ■ < ax , I = [ao, clk] and let fi, X) be the multivariate Gaussian den¬ 
sity with mean vector // and covariance matrix X. Furthermore, let ( 7 / 4 . : |, a/,.] -A M) fce r 1 be 

C 1 functions, and (kk ■ [afc-i, a,k\ x [afc_i, cik\ —>• R)fce[i../v] be C 3 symmetric positive semi-definite 
functions, neither degenerate at ak- 1 , nor degenerate at a,/,, given a/,._ ]. 

(A) There exists an Revalued stochastic process (SDt)tei, SD t = ( zt , 4) satisfying the following 
conditions: 

1) The probability density of ( SD ao ,..., SD aK ) reads: 

I< 

Pb{x 0 , • • •, zir) := PAT Xfc) 

k =0 

where: X 0 = iK ao;oo , V fc > 0 X fc = fcK afc;afe — feK afc;afc _ 1 feK afc _ i;afc _ 1 
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= i M ao> V£:>0 n b k = k M 


a k 


+ fcK afc;afc _ 1 k^a k _ 1 -,a k _ 1 


_A x k -1 


with 


K 'I 


h(u,v) 




and fe M u 


m k (u ) 


2 ) Conditional on (ST^ = Zfc)fce[o..ich the restrictions (S , A)te]a fc _ lj a fc [ ) & C [1..7vT] are indepen¬ 
dent conditional derivative Gaussian processes, respectively with unconditional mean function 
rrik and unconditional covariance function k k and that are conditioned to take values x k -i and x k 
at a k _ i and a k respectively. We refer to {SD t )t^i as a string derivative Gaussian process, and to 
its first coordinate ( zt)tei as a string Gaussian process namely, 


{zt)tei ~ SQT({a k }, {m k }, {k k }). 

(B) The string Gaussian process (zt)tei defined in (A) is C 1 in the L 2 sense and its L 2 derivative 
is the process (z' t )tei defined in (A). 


Proof 


Appendix C.l Proof of Theorem 2 (A) 

We will once again turn to Kolmogorov’s extension theorem to prove the existence of the stochastic 
process {SD t )t^i- The core of the proof is in the finite dimensional measures implied by Theorem 
2 (A-l) and (A-2). Let [t^ (=\a k - \ ■ a k{}, r \\ iV( ,] kc'\ n tnncs - We first formally construct the 

finite dimensional measures implied by Theorem 2 (A-l) and (A-2), and then verify that they satisfy 
the conditions of Kolmogorov’s extension theorem. 

Let us define the measure ;r, , K K as the probability measure having density 

with respect to the Lebesgue measure on £>(M 2 x • • • xM 2 ) that reads: 

V -v-' 

1 +n-\-K times 


PSD { x t} j • • • j x t\ T •>••••> x t¥ ? • • • > x t¥ T 5 x ao > • • • ? x a K ) —Pb{ x a 0 ? • • • ? x a K ) x 

1 lvi I ^ K 

K 

pm (***>•■■, 


X+k 


( 68 ) 


k =1 


where pt, is as per Theorem 2 (A-l) and p^- k 1 ’ " k {x t k,,x t k ) is the (Gaussian) pdf of the joint 

distribution of the values at times {tf e]afc_i, a k [\ of the conditional derivative Gaussian process 
with unconditional mean functions rn k and unconditional covariance functions k k that is condi¬ 
tioned to take values x ak l = (z ak _ ] , z' a ) and x ak = (z ak , z ' a ) at times a k _ | and a k respectively 
(the corresponding—conditional—mean and covariance functions are derived from Equations (3 
and 4). Let us extend the family of measures v SD to cases where some or all boundary times a k are 
missing, by integrating out the corresponding variables in Equation (68). For instance when ao and 
a\ are missing, 


•fU. * .* «.«»*••••.. -,TiS K ,A 2 ,...,A K ) 


:= v 


N i 

SD 

>i +i p 




Ni > 


rpK 
i 1 ! > 


rpK TTJ 2 ■ 

j j 


\ M, ■ ■ -,A k) 


(69) 


67 








Kom Samo and Roberts 


where Ai and Tj are rectangle in M 2 . Finally, we extend the family of measures v SD to any arbitrary 
set of indices {t \,..., t n } as follows: 


■ ■ ■ ,T n ) : ^* ( 1 ) ,...,t^« (n) (T ^*( 1 ),... ,Tn*(n)), 


,SD 


(70) 


where 7 r* is a permutation of { 1 ,..., n} such that (fwi), • • •, iwn)} verify the following condi¬ 
tions: 


1. Vi, j, if ti ak t [, tj E]afc 2 _i, ak 2 [, and k\ < & 2 , then Idx(ij) < Idx(f J ). Where 

Idx(ij) stands for the index of ti in {t^*^,..., f-Wn)}; 

2. if ti {ao, • • •, ax} and tj E {do,..., clk} then Idx(T) < Idx(fj); 

3. if ti E {do, ■ • •, clk} and tj E {do,..., clk} then Idx(tj) < Idx(tj) if and only if ti < tj. 

Any such measure will fall in the category of either Equation ( 68 ) or Equation (69). 

Although 7 r* is not unique, any two permutations satisfying the above conditions will only differ 
by a permutation of times belonging to the same string interval ]dfc_i,dfc[. Moreover, it follows 
from Equations ( 68 ) and (69) that the measures j are invariant by permutation of times 

belonging to the same string interval ]dfc_i,dfc[, and as a result any two it* satisfying the above 
conditions will yield the same probability measure. 

The finite dimensional probability measures z/jr D tn are the measures implied by Theorem 2. 
The permutation condition (57) of Kolmogorov’s extension theorem is met by virtue of Equa¬ 
tion (70). In effect for every permutation 7 r of {1,..., n}, if we let tv' : {vr(l),..., 7 r(n)} -» 
{tv* (1 ),..., 7 r*(re)}, then 


ySD 

^7r(l) 5*"5^7T(n) 


(^Tr(t) J • ■ ■ )^ 7 T(n)) : t 'V 0 7 r(i)v^Vo 7 r(n)^ 7r, ° 7r ( 1 )’ ' ' ‘ ’ -^r'oir(n)) 

= Z/ 4*(i)>-,V(™)( r7r *( 1 )’ ■ • ■ ' T Tr*{n)) 

= ^w„m,...,T n ). 


As for the marginalisation condition (58), it is met for every boundary time by virtue of how we 
extended v SD to missing boundary times. All we need to prove now is that the marginalisation con¬ 
dition is also met at any non-boundary time. To do so, it is sufficient to prove that the marginalisation 
condition holds for t\, that is: 


u SD 

A t K t K 

“"l iVj ’ —>'1 


,a 0 ,...,a K 


.T 


n K 


2 > ■ ■ • j ± N 1 i ■ ■ ■ > ± 1 i - ■ ■ 1 Tn k -)A 0) • • • , Ak) 

( T \,..., , ■ ■ •, Tf ,..., T« k , A 0 ,..., A k ) 


= v t i 
z 2 


SD 


Z N 1 ’ 


t K 

,L, r .,t. 


,ao,...,a K 


(71) 


for every rectangles A, and Tj in M 2 . In effect, cases where some boundary times are missing are 
special cases with the corresponding rectangles Aj set to M 2 . Moreover, if we prove Equation (71), 
the permutation property (57) will allow us to conclude that the marginalisation also holds true for 
any other (single) non-boundary time. Furthermore, if Equation (71) holds true, it can be shown 
that the marginalisation condition will also hold over multiple non-boundary times by using the 
permutation property (57) and marginalising one non-boundary time after another. 
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By Fubini’s theorem, and considering Equation (68), showing that Equation (71) holds true is 
equivalent to showing that: 




*^an ?*£ai 

= PM 




x t i 


) 


(72) 


which holds true as p^° ' x " ] (x t i,, x t i^ ) is a multivariate Gaussian density, and the correspond¬ 
ing marginal is indeed the density of the same conditional derivative Gaussian process at times 
t 1 t 1 

This concludes the proof of the existence of the stochastic process (SD t )t£i- 


Appendix C.2 Proof of Theorem 2 (B) 

As conditional on boundary conditions the restriction of a string derivative Gaussian process on a 
string interval {(ik-\ > a k\ is a derivative Gaussian process, it follows from Proposition 1 (C) that 


^ x ag i • ■ • i x a,K i V t , t + h G [a,k~ i, afc] , 

i 2 \ 

x ag — x ao> • • • i •tta.x — ^O’K I — 0) 


lim E 

h ->0 


Zt+h - Zt , 

- Zt 


h 


(73) 


or equivalently that: 


A z h := E 


zt+h ~ z t 
h 


- z t 


h^t 0 


0 . 


(74) 


Moreover, 

A Zh = Var 


z t+h ~ z t 

h 


- Zt 


+ E 


Zt+h - Zt 

h 


- Zt 


(75) 


As both terms in the sum of the above equation are non-negative, it follows that 


Var 


zt+h ~ z t 

h 


- Zt 


From which we deduce 


h->0 


( Zt+h - Zt , 
- Zt 


0 and E 


Zt+h - Zt 

h 


- Zt 


h-t 0 


0 . 


V h 


x aoi ■ ■ ■ i x a K I ~ ^ 0 - 


AsE 


(zt+h-zt J 
^ h Z + 


x ao , ■ ■ ■, x aK ) depends linearly on the boundary conditions, and as the bound¬ 


ary conditions are jointly-Gaussian (see Appendix H step 1), it follows that 

E 


z t +h ~ z t , 

- Zt 


is Gaussian. Finally we note that 

Var 


h 


z t +h ~ z t , 

- u - z t 


f-ao }■■■■> x ctK 


x ao j ■ ■ ■ i x a,K 
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does not depend on the values of the boundary conditions x ak (but rather on the boundary times), 
and we recall that convergence almost sure of Gaussian random variables implies convergence in 
L 2 . Hence, taking the expectation on both side of Equation (75) and then the limit as h goes to 0 
we get 


E 


z t+ h ~ z t 

h 



E(A z h ) — 0 

h —^0 


which proves that the string GP ( zt)tei is differentiable in the L 2 sense on I and has derivative 

KW- 

We prove the continuity in the L 2 sense of (z' t )t^i in a similar fashion, noting that conditional 


dm 


on the boundary conditions, ( z' t )tei is a Gaussian process whose mean function rk 


l k-l’ a k 


dt 


and 


covariance function 


dxdy 


are continuous, thus is continuous in the L 2 sense on [a k -i,a k ] 


(conditional on the boundary conditions). We therefore have that: 


V x O0 ,..., Xa K , \/t,t + he [a k -i, a k ], lim E ((z' t+h - z' t f 


x, 


a . o 


= X, 


a 0 , , X aK 


= X aK ) = 0, 
(76) 


from which we get that: 

Az h '■= E ( [zt +h ~ 4 ] 

Moreover, 

Az'h = Var (z' t+h - z' t \x ao ,..., x aK ) + E (z' t+h - z[ 

which implies that 


h-> 0 


0 . 


^ao )■■■■> Xaji 




Var (z' t+h - 4|*ao. • • •. x aK ) 0 


and 


/l—T 0 


t~> / / / I ffl.S. 

E Y't+h ~ ^i|*oo. ‘ ' ' i Xa Kj 


(77) 


(78) 


as both terms in the sum in Equation (78) are non-negative. Finally, 

Var(z' t+h -z' t \x ao ,...,x aK ) 

does not depend on the values of the boundary conditions, and 

E (^7+h > *o.sr) 

is Gaussian for the same reason as before. Hence, taking the expectation on both sides of Equation 
(78), we get that 

vi a-' 


(\ z t+h 


- Z + 


= E(A4) —> 0, 

/i.-rO 


which proves that (z' t ) is continuous in the L 2 sense. 
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Appendix D. Proof of the Condition for Pathwise Regularity Upgrade of String GPs 
from L 2 

In this section we prove that a sufficient condition for the process (z'fjtei i n Theorem 2 to be almost 
surely continuous and to be the almost sure derivative of the string Gaussian process ( zf)tei, is 
that the Gaussian processes on = |a/._]. af\ with mean and covariance functions rn*£r 1,ak and 
C - 1 Jlk (as per Equations 3 and 4 with m := mk and k := kf) are themselves almost surely C 1 for 
every boundary condition. 

Firstly we note that the above condition guarantees that the result holds at non-boundary times. 
As for boundary times, the condition implies that the string GP is almost surely right differentiable 
(resp. left differentiable) at every left (resp. right) boundary time, including a o and ax. Moreover, 
the string GP being differentiable in L 2 , the right hand side and left hand side almost sure derivatives 
are the same, and are equal to the L 2 derivative, which proves that the L 2 derivatives at inner 
boundary times are also in the almost sure sense. A similar argument holds to conclude that the 
right (resp. left) hand side derivative at ao (resp. ax) is also in the almost sure sense. Moreover, the 
derivative process (z[)tei admits an almost sure right hand side limit and an almost sure left hand 
side limit at every inner boundary time and both are equal as the derivative is continuous in L 2 , 
which proves its almost sure continuity at inner boundary times. Almost sure continuity of (z'fjt^i 
on the right (resp. left) of a o (resp. ax) is a direct consequence of the above condition. 


Appendix E. Proof of Proposition 4 

In this section, we prove Proposition 4, which we recall below. 

Proposition 4 (Additively separable string GPs are flexible) 

Let k(x,y) := p (||x — ;y| | 2 2 ) be a stationary covariance function generating a.s. C 1 GP paths in¬ 
dexed on d > 0, and p a function that is C 2 on ]0, +oo[ and continuous at 0. Let <j> s (x \,..., Xd) = 
x j ’ l et ( z i)tei'i, je[i..d] ^ independent stationary Gaussian processes with mean 0 and covari¬ 
ance function k (where the L 2 norm is on M), and let td) = ( ps(z}\ ,..., zf ) be the corre¬ 

sponding stationary string GP. Finally, let g be an isotropic Gaussian process indexed on / 1 x • • • x /' , 
with mean 0 and covariance function k (where the L 2 norm is on W 1 ). Then: 

1) V x e I 1 x ■ ■ ■ x I d , H(yf{x )) = H(Xg(x)), 

2) V x + y G I 1 x • • • x I d , 7(V/(x); V/(y)) < 7(V ff (x); Vg(y)). 


To prove Proposition 4 we need a lemma, which we state and prove below. 


Lemma 8 Let X n be a sequence of Gaussian random vectors with auto-covariance matrix S n and 
mean p n , converging almost surely to X^. IfT, n —» and //„ —» p, x then X^ is Gaussian with 

mean and auto-covariance matrix Eoq. 


Proof We need to show that the characteristic function of X^ is 

cj) Xoo (t) := E(e itTx “) = e ^ T Moo-it T s^_ 


As Y, n is positive semi-definite, Vn, |e* fT/in 2 tTSnt | 
dominated convergence theorem, 

(Av, (t) = E( lim e itTXn )= lim E(e itTXn ) = 

n^+oo n—>■+oo 


— e 2 tT ^nt < i Hence, by Lebesgue’s 

lim 2 ^ —— oo 2 ^00 1 

n—>+00 
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Appendix E.l Proof of Proposition 4 1) 

Let x = (tf ,..., t x d ) E I 1 X ■ ■ ■ X I d . We want to show that H(Vf(x)) = H(Vg(x)) where / and 
g are as per Proposition 4, and H is the entropy operator. Firstly, we note from Equation (11) that 


V/ (x) — (zj* ,..., zfi'j 


(79) 


where the joint law of the GP {zf) te jj and its derivative (z{ ) te jj is provided in Proposition 1. As 


the processes \z 3 t: z 3 t J , j G [1 ,.ri] are assumed to be independent of each other, V/(.x') is a 
Gaussian vector and its covariance matrix reads: 


S V/0r) - 


where 1^ is the d x d identity matrix. Hence, 


H(\7f(x)) = ^ (1 + ln(2vr)) + ^ln|E v/(a .)|. 


(80) 


( 81 ) 


Secondly, let e 3 denote the d-dimensional vector whose j-th coordinate is 1 and every other coor¬ 
dinate is 0, and let h £ M. As the proposition assumes the covariance function k generates almost 


surely C 1 surfaces, the vectors 


g(x+hei)-g(x) 
h ’ 


g(x+he d )-g(x ) 
h 


are Gaussian vectors converging 


almost surely as h —> 0. Moreover, their mean is 0 and their covariance matrices have as element 
on the i-th row and j-th column (i / j): 


cov 


g(x + hei) - g(x) g(x + hej) - g{x) \ p{ 2 h 2 ) - 2 p(h 2 ) + p(0) 


h 2 


(82) 


and as diagonal terms: 


Var 


g(x + hej) - g(x)\ ^(O) - p(h 2 ) 


h 


= 2 - 


h 2 


(83) 


Taking the limit of Equations (82) and (83) using the first order Taylor expansion of p (which the 
Proposition assumes is C 2 ), we get that: 


^Vg(x) f( x )’ 

It then follows from Lemma 8 that the limit Vg(x) of 

'g(x + hei) - g(x) g(x + he d ) - g(x) 


(84) 


h h 

is also a Gaussian vector, which proves that H(V f(x)) = iT(V g{x)). 
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Appendix E.2 Proof of Proposition 4 2) 

We start by stating and proving another le mm a we will later use. 

Lemma 9 Let A and B be two d-dimensional jointly Gaussian vectors with diagonal covariance 
matrices Xa and X B respectively. Let X a,b be the cross-covariance matrix between A and B, and 
let diag(TiA,B ) be the diagonal matrix whose diagonal is that of'L / \j>. Then: 


det 


( 

Xa 

diag(T, AtB ) 

Al 


^A,B 

V 

diag(T , AtB ) 

Xb 

yT 

L ^A.B 

X b 


Proof Firstly we note that 


det 


Xa diag(XA,s) 
diag(XA,_B) X B 


det(X A )det (X B - diag(XA,s)X y4 1 diag(XA,s)) 


det 


— det(XA)det (Xb — X^ ^X^Xa,#) • 


and 

Xa Xa,_b 
Xa,b Xb _ 

As the matrix Xa is positive semi-detinite, det(XA) > 0. The case det(XA) = 0 is straight-forward. 
Thus we assume that det(XA) > 0, so that all we need to prove is that 


det (X B - diag(XA,s)XA i diag(XA,s)) > det (Xb - Xa i bX a 1 Xa,b) • 

Secondly, the matrix X^'^ := Xb — diag (X a , b ) X a 1 diag (X 4 iB ) being diagonal, its determinant is 
the product of its diagonal terms: 


det(X' 


diag \ 
B\A> 




.*>* = 


i= 1 


i —1 


nM 


£a,bM] 2 \ 
Xa[l*] / 


As for the matrix Xb|a := Xb — X^ ,B^A Xa,b> we note that it happens to be the covariance matrix 
of the (Gaussian) distribution of B given A, and thus is positive semi-definite and admits a Cholesky 
decomposition X B |a = LL 1 . It follows that 


d d d 

det(X B |A) = n 2 - II ^b\a[l *] = 


i— 1 


1=1 

d 


TT I 'r r- -i 


1=1 


S II ( ) = det(E d “® 


1=1 


T, A [i,i\ 


b\a> 


(85) 


where the first inequality results from the fact that Xb|a[*> *] = Yl}= 1 ^[j- i] 2 by definition of the 
Cholesky decomposition. This proves that 

det (Xb - diag(XA,B)XA 1 diag(XA,B)) > det (X B - X^bX^Xa.b) , 

which as previously discussed concludes the proof of the lemma. ■ 
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Proof of Proposition 4 2): Let x = (tf ,..., t ^), y = (tf,..., £ I 1 x • • • x I d , x / y. 

We want to show that J(V/(x); V/(y)) < /(Vg(x); Vg(y)) where / and g are as per Proposition 
4, and 

/(X; Y) = ff(X) + H(Y) - H{X , Y) 

is the mutual information between X and Y. As we have proved that V x , H (V/(.x)) = H (Vg(x)), 
all we need to prove now is that 


H(Vf(x),Vf(y)) > if(Vg(x),Vg(y)). 

Firstly, it follows from Equation (79) and the fact that the derivative Gaussian processes (zf, z 3 ') t£l j 
are independent that (V/(x), V/(y)) is a jointly Gaussian vector. Moreover, the cross-covariance 
matrix ,v/(y) is diagonal with diagonal terms: 


S v/(aO,v/( y )M - “2 




p 
dx 2 


\x — 


110 


( 86 ) 


Secondly, it follows from a similar argument to the previous proof that (Vg(x), Vg(y)) is also 
a jointly Gaussian vector, and the terms ,v g (y) [h j] are evaluated as limit of the cross- 

' g{x+hej)-g{x) g(y+hej)-g( y)' 
h ’ h 


covariance terms cov 


as h —>• 0. For i = j, 


cov 


g(x + hei) - g(x) g(y + het) - g(y) \ 1 


h 


h 


h? 


2 P 


If 


- p - tV kf + + h ~ ff - p\ - 0) 2 + - h ~ t y if I 0 ( §7 ) 

As p is assumed to be C 2 , the below Taylor expansions around h = 0 hold true: 

p (em - ® 2 ) - p ( Em - 4> 2 + m - h - 4) 2 j = 2(<f -1?)^ (e)m - © 2 ) 


( 88 ) 


dp 

dx 


EM-fl’I+’W-fl’sslEM-r 1 


dx 2 


/i 2 + o(/i 2 ) 


P 




~ P 


y- od 2 +(*f+^ 

k^i 





d 2 - 


-^) 2 )+2(tf-iD 2 J(E 




/i 2 + o(/i 2 ) 
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Plugging Equations (88) and (89) into Equation (87) and taking the limit we obtain: 

42. 


S Vg(a;),Vg(y)[M] - ~ 2 




X -: 


\L 2 ) 


~ S V/(*),V/(i/)M- 

Si mi I ally for i / j, 

'g{x + ha) - g{x) g(y + hej) - g(y) 


cov 


h 


h 


1 


i 2 ) P f (** - t y k) 2 + (if + h - t^f + (tj — h — tyA 

l V / 

p ( E <** - ® 2 + k+a -® 2 ) - p ( E <** - ® 2 + «-'>- *?) 2 


ky^i 




and 


+ p - ‘S 2 )}. 

E M - ® 2 + («? + '•- <?) 2 + (tj -h- f”) 2 ] - P (e<« - ® 2 ) 

J \ k / 

2 af (e« - ® 2 ) + 2 ((tr - (?) - ((? -®) 2 x g (e« - ‘h 
+2 (tr - (? - <j+(?) ^ (e« - ® 2 )'(+»('* 2 )- 


(90) 


(91) 


(92) 


Plugging Equations (88), (89) and (92) in Equation (91) and taking the limit we obtain: 

E Vff(s),V ff (y)M = “ 4 (tf - %){% ~ (H X “ y\\h) ■ (93) 

To summarize, (V/(x), V/(y)) and (\7g(x),V g(y)) are both jointly Gaussian vectors; V/(x), 
Vg(x), V/(y), and Vg{y) are (Gaussian) identically distributed with a diagonal covariance ma¬ 
trix; Ey fM.vf( y ) is diagonal; E Vg(x).Vg(y) has the same diagonal as Ev/(z).v/(y) but has pos¬ 
sibly non-zero off-diagonal terms. Hence, it follows from Lemma 9 that the determinant of the 
auto-covariance matrix of (V/(x), V/(y)) is higher than that of the auto-covariance matrix of 
(Vg(x), V< 7 (y)); or equivalently the entropy of (V/(x), V/(y)) is higher than that of (Vg(x), Vp(y)) 
(as both are Gaussian vectors), which as previously discussed is sufficient to conclude that the mu¬ 
tual information between V/(x) and V/(y) is smaller than that between Vg(x) and Vg(y). 


Appendix F. Proof of Proposition 6 

In this section, we prove Proposition 6, which we recall below. 
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Proposition 6 (Extension of the standard GP paradigm ) 

Let K G N*, let I = [ao, clk\ and I k = [a k _ 1 , a;,-] be intervals with oo < • • • < ax- Furthermore, 
let m : I —> M be a C 1 function, m k the restriction of m to Ik, h : I x I —> R a C 3 symmetric 
positive semi-definite function, and h k the restriction of h to 4 x 4 . If 

(zt)te/ ~ SQV({a k }, {m k }, {h k }), 

then 

VfcG [1..K], (zt) te i k ~gV(m,h). 

Proof 

To prove Proposition 6, we consider the string derivative Gaussian process (Theorem 2) (SDt)tei, 
SD t = ( zt, z' t ) with unconditional string mean and covariance functions as per Proposition 6 and 
prove that its restrictions on the intervals I k = [a k ~ 1 , a k \ are derivative Gaussian processes with the 
same mean function m and covariance function h. Proposition 1(B) will then allow us to conclude 
that ( zt)tei k are GPs with mean m and covariance function h. 

Let t±,... ,t n E\a k -i,a k [ and let p D {x ak _ 1 ) (respectively PD(x ak \x ak _ 1 ) and 
pn{x t] ,... ,xt ri \x ak _ 1 ,x ak )) denote the pdf of the value of the derivative Gaussian process with 
mean function m and covariance function h at a k ~ 1 (respectively its value at a k conditional on its 
value at a k -i, and its values at t \,..., t n conditional on its values at a k -\ and a k ). Saying that the 
restriction of the string derivative Gaussian process (SDt) on [a k _ 1, a k ] is the derivative Gaussian 
process with mean m and covariance h is equivalent to saying that all finite dimensional marginals 
of the string derivative Gaussian process psD(xa k _ 1 ,xt 1 ,... ,xt„,Xa k ), U G [a k _i,a k ], factorise 
as 20 : 

PSD(Xa k _ 1 j Xt 1 , . . . , Xt n , X ak ) — pjo(x ak _ 1 )p£)(Xa k \ Xa k _ 1 )PD(xti ; ■ ■ • i Xt n \ x ak _ 1 , X ak ). 
Moreover, we know from Theorem 2 that by design, psD(x ak _ 1 ,xt lt , xt n ,x ak ) factorises as 

PSD(Xa k _ 1 j Xt x ,■■■■, Xt n , X ak ) PSD(Xa k ~i )PD(Xa k \X ak _ 1 ])PD(xt\ j ■ • • j Xf n \X ak _ 1 , X ak ). 

In other words, all we need to prove is that 

PSD(Xa k ) =p D (Xa k ) 

for every boundary time, which we will do by induction. We note by integrating out every boundary 
condition but the first in pb (as per Theorem 2 (a-1)) that 


PSD(Xa 0 ) =p D (x ao )- 

If we assume that PsD(x ak _ 1 ) = PD(x ak _ 1 ) for some k > 0, then as previously discussed the 
restriction of the string derivative Gaussian process on [a k - \ ■ «■/,•] will be the derivative Gaussian 
process with the same mean and covariance functions, which will imply that psn(x ak ) = pr>(xa k .)- 
This concludes the proof. ■ 


20. We emphasize that the terms on the right hand-side of this equation involve po not psd- 
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Appendix G. Proof of Lemma 10 

In this section, we will prove Lemma 10 that we recall below. 


Lemma 10 Let X be a multivariate Gaussian with mean /ix and covariance matrix Ex- If con¬ 
ditional on X, Y is a multivariate Gaussian with mean MX + A and covariance matrix Ey where 
M, A and Ey do not depend on X, then (X, Y) is a jointly Gaussian vector with mean 


HX-Y 


PX 

Mhx + A\ ’ 


and covariance matrix 

y = [ E X M T 

X]Y [M E x Y c y + MY x M t ' 

Proof To prove this lemma we introduce two vectors X and Y whose lengths are the same as those 
of X and Y respectively, and such that (X, Y) is jointly Gaussian with mean px-,Y and covariance 
matrix Ex ; y- We then prove that the (marginal) distribution of X is the same as the distribution of 
X and that the distribution of Y\X = x is the same as Y\X = x for any x, which is sufficient to 
conclude that (X, Y) and (X, Y) have the same distribution. 

It is obvious from the joint (X, Y) that X is Gaussian distribution with mean jix and covariance 
matrix Ex- As for the distribution of Y conditional on X = x, it follows from the usual Gaussian 
identities that it is Gaussian with mean 


M[ix + c + MY,x'Y x 1 (x — fix) = Mx + c, 


and covariance matrix 


Ey + ME x M t - MExEy 1 Ex M T = Ey, 

which is the same distribution as that of Y|X = x since the covariance matrix Ex is symmetric. 
This concludes our proof. ■ 


Appendix H. Proof of Proposition 5 

In this section we will prove that string GPs with link function 0 S are GPs, or in other words 
that if / is a string GP indexed on M' :/ . d > 0 with link function 0 s (x\..... xj) = Ylj=i X P 
then (f(x \),..., / (x n )) has a multivariate Gaussian distribution for every set of distinct points 

x\,... ,x n G R d . 

Proof As the sum of independent Gaussian processes is a Gaussian process, a sufficient condition 
for additively separable string GPs to be GPs in dimensions d > 1 is that string GPs be GPs in 
dimension 1. Hence, all we need to do is to prove that string GPs are GPs in dimension 1. 

Let , z J t ') teI j be a string derivative GP in dimension 1, with boundary times a ( j,..., a? Kj , and 
unconditional string mean and covariance functions m 3 k and kl respectively. We want to prove that 
(z J tl ,..., zj n ) is jointly Gaussian for any 1 1 ,..., t n 6 I 3 . 
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Appendix H.0.1 Step 1 ( zL , zL ,..., z 3 a ,, zd ,) is jointly Gaussian 

We first prove recursively that the vector , zi' 0 ,..., zi K -, Za Kj ^J is jointly Gaussian. We note 
from Theorem 2 that (zf, zj') te [ a0jai ] is the derivative Gaussian process with mean m\ and co- 
variance function k\. Hence, (z 3 ao , z 3 d 0 , z 3 ai , z^) is jointly Gaussian. Moreover, let us assume that 
Bk -1 := (zl 0 , z J a 0 ,..., zi k _ 15 ^al_i) is jointly Gaussian for some k > 1. Conditional on B^-i, 
(, z 3 a k , z 3 a k ) is Gaussian with covariance matrix independent of Bk- 1 , and with mean 


L dt \ a k) J 

X Jv J if— 1 

K a k’ a k-l K a k _ 1 ;a J k 

Z ij ~ m k( a k-1) 

a k~ 1 

J' (J ) 

z j dt y a k-1) 



k-1 


which depends linearly on ( z& 0 , z 3 d 0 ,..., zi k _ 1 , z J d k _ 1 ). Hence by Lemma 10, 

( 2 ? ^ 

V^ao ’ ao ’ * * * 5 ^a/c ? ) 

is jointly Gaussian. 

Appendix H.0.2 Step 2 (zl 0 , 4' 0 ,..., zi Ki , z 3 d Kj ,..., z 3 k , z 3 !,...) is jointly Gaussian 

Let t\ : ..., t k k o]a],_!. a],[, /r < /v' y be distinct string times. We want to prove that the vector 
(zL,z 3 ao ,, zL., z 3 a ■,..., zP k , z 3 k ,...) where all boundary times are represented, and for any 

i i 

finite number of string times is jointly Gaussian. Firstly, we have already proved that 

{z 3 a 0 ,z 3 a o, ■ • •, z 3 a KJ , z 3 d Kj ) is jointly Gaussian. Secondly, we note from Theorem 2 that conditional 

on ( z 3 ao , z 3 a 0 ,.. •, z 3 a ■, z 3 ! ,.),(..., z 3 k , z 3 k ,...) is a Gaussian vector whose covariance matrix does 

i £ 

not depend on ( z 3 ao ,z 3 a ' 0 , ■ ■ ■, z J a Kj , z 3 d K] ), and whose mean depends linearly on 

(j ~j> J ~i> \ 

y~Q0 > ^a 0 > • • ' J *a K j > ^a K;; y ■ 

Hence, 

(V J' z 3 J’ J J> \ 

y^ao ? -^ao ’ * * • » a xj ’ a KJ 5 * * * ’ ’ * * * y 

is jointly Gaussian (by Lemma 10). 

Appendix H.0.3 Step 3 {z 3 tl ,.... z{ n ) is jointly Gaussian 

(z 3 t] ,z 3 t ' 1 ,..., z { n , ) is jointly Gaussian as it can be regarded as the marginal of some joint distri¬ 
bution of the form (4 0 ,z 3 a ' 0 ,..., ,..., z 3 k , z 3 d,...). Hence, its marginal (z 3 tl ,..., z{ n ) is 

also jointly Gaussian, which concludes our proof. ■ 


Appendix I. Derivation of Global String GP Mean and Covariance Functions 

We begin with derivative string GPs indexed on M. Extensions to membrane GPs are easily achieved 
for a broad range of link functions. In our exposition, we focus on the class of elementary symmetric 
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polynomials (Macdonald (1995)). In addition to containing the link function 4> s previously intro¬ 
duced, this family of polynomials yields global covariance structures that have many similarities 
with existing kernel approaches, which we discuss in Section 4.3. 

For n < d, the n-th order elementary symmetric polynomial is given by 

n 

e 0 (xi,... ,x d ) := 1, VI < n < d e n (xi,... ,x d ) = ^ (94) 

l<jl<j2<—<jn<d k= 1 


As an illustration, 


d 

ei(xi, ...,x d ) = '^2 x j = c/) s (x i,.. .,x d ), 

3 = 1 


e 2 (xi, ...,x d ) = x\x 2 + xix 3 H-h xix d H-b x d -ix d , 


d 

e d (xi,...,x d ) = \\xj =<t> p (x i,...,x d ). 

3 = 1 

Let / denote a membrane GP indexed on M. d with link function e n and by (zj),..., (zf) its inde¬ 
pendent building block string GPs. Furthermore, let m J k and />/ denote the unconditional mean and 
covariance functions corresponding to the £;-th string of (zj ) defined on [a{._ 1 ,a^]. Finally, let us 
define 

m J (t) := E (z J t ), rh?'(t) := E 

the global mean functions of the j-th building block string GP and of its derivative, where Vt G IP 
It follows from the independence of the building block string GPs (zj ) that: 

rh f (t t,.. .,t d ) := E(/(fi,... ,t d )) = e^fh 1 (G),.. .,fh d (t d )). 

Moreover, noting that 


it follows that: 


9e n . . 

— £n—l\%li • • • i %j—h 'Ej+ 1? • • • j %d)') 


m Vf (t i, ...,t d ) :=E(V/(ti,.. .,t d )) 


m v {ti)e n -1 (m 2 (t 2 )i... ,m d (t d )) 
m d, (t d )e n -i (■ m 1 (t 1 ),.. ..m^ 1 ^-!)) 


Furthermore, for any Uj. Vj G I 3 we also have that 


cov (/(til, • • ■, U d ), f{v 1 ,..., v d )) = e n 


( cov (4i>4i)>---> cov (^ 


d 

u d i 
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and for i < j 



e n (cov (4,44 ... ,cov(z*,zi.), ... ,cov(4, z J Vj ), ..., cov(4 d , 4j) > 
e n (cov(4 lS 4j,... ,cov(4'., 4'),...,cov(4 d , 4 d )) , 


if i<j 

if i = j 


Overall, for any elementary symmetric polynomial link function, multivariate mean and covariance 
functions are easily deduced from the previously boxed equations and the univariate quantities 


m? (u), and J K u;u := 


cov(4,4) cov(4,4') 
cov(4',4) cov(4',4') 


= j K T 

V\U1 


which we now derive. In this regards, we will need the following lemma. 

Lemma 10 Let X be a multivariate Gaussian with mean fix an d covariance matrix Ex- If condi¬ 
tional on X, Y is a multivariate Gaussian with mean MX + A and covariance matrix Ey where 
M, A and Ey do not depend on X, then (X, 1") is a jointly Gaussian vector with mean 


dx-Y 


dx 

Mp x + A\ ' 


and covariance matrix 


Proof See Appendix G. 


' Ex E X M T 

MEx Ey + ME x M t 


Appendix 1.0.4 Global String GP Mean Functions 

We now turn to evaluating the univariate global mean functions fhP and w?'. We start with boundary 
times and then generalise to other times. 

Boundary times: We note from Theorem 2 that the restriction ( zj, zl' ) is the derivative 

V /te[aQ,aj] 

Gaussian process with mean and covariance functions m\ and k:\. Thus, 



1 1 

■r-,0'^0 

i_i 

= 

mi(4) 

fxLtJ) 

L dt l a olJ 

, and 

1 1 

1_1 

= 

.¥(4). 



For k > 1, we recall that conditional on 




J J> 

t , * x 

ai a J , 


is Gaussian with mean 


m kH) 

dm J y 

dt 




+ {k 


3 3 

k’ k-1 


j k - 1 

fcWA , n 3 

a k-l’ a k-l 


Z J ~ m fe(°fe-1) 


J' 


dm? k 
~dt~ 1 


"k-V 
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String times: As for non-boundary times t e]a^_ l5 a J k [, conditional on yzi k _ 1 , Za k _ 1 j and \ z 3 ak , z}[ k j, 
(z{, zj' \ is Gaussian with mean 



Appendix 1.0.5 Global String GP Covariance Functions 

As for the evaluation of J, K U .„, we start by noting that the covariance function of a univariate string 
GP is the same as that of another string GP whose strings have the same unconditional kernels but 
unconditional mean functions mi = 0, so that to evaluate univariate string GP kernels we may 
assume that V j, k, mi = 0 without loss of generality. We start with the case where u and v arc 
both boundary times, after which we will generalise to other times. 
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Boundary times: As previously discussed, the restriction ( zJ t , zf 


sian 


process with mean 0 and covariance function k\. Thus, 


t&l a3 0 ,a{\ 


is the derivative Gaus- 


Jk j K k K ■ -'K K ■ 

cl q ',clq 1 a \\ a \ 1 a i .; a i 7 J - 


(95) 


We recall that conditional on the boundary conditions at or prior to a? k v ( z J j , z J ' :j J is Gaussian 
with mean 


.3' 

k a l 


IM 


a i -1 

J' 


with IM = IK 


jV-l 

°-k'K-l k O’k—i'O'k—i 1 


6 s = !k 


— IM , K 




and covariance matrix 

I.Zj — j IV j J L.J.VJ. j IV j 

k k a k’ a k k k a k-V° 

Hence using Lemma 10 with M = [|M 0 ... 0] where there are (k — 1) null block 2x2 

is jointly Gaussian, it follows that the vector 


matrices, and noting that ( zJ :j , z 3 ' :j , 


0 5 

'0 “0 


y 3 y 3 r 

? ^ ?' 1 ^ 
a J , a, 

Ac — 1 k — 1 


yj' 

^„3 > ' ■ • 5 


is jointly Gaussian, that I zL , z 3 ' :j I has covariance matrix 


( k i i ■ Im-'k, , aj \m t , 


and that the covariance matrix between the boundary conditions at a{ and at any earlier boundary 
time aj, l < k reads: 


j K , K 


<-’ a i k 


3 3 • 

fc-l’ a Z 


String times: Let u G [a? aj,], v G [a? q _ ,, a J q \ . By the law of total expectation, we have that 


J K — F 


Z 3 u 

Zu 


3 3 f 
Zy Zy 


= E E 


zi 

Zu 


J Jj! 

Ay Ay 


B{p,q) 


where B(p, q ) refers to the boundary condtions at the boundaries of the p-th and q-th strings, in other 
words < z 3 x , Zx , x G {a J { . aj,, 1 , a)} >. Furthermore, using the definition of the covariance 


matrix under the conditional law, it follows that 


zi 

Zu 


3 3' 

Zy Zy 


B(p,q ) = s c K u . v +E 


zi 

Zu 


B(p,q) )e( zi zi' 


B(p,q)), (96) 


where 3 C K U - V refers to the covariance matrix between (zi, zi!) and (zi, zi') conditional on the bound¬ 
ary conditions B(p, q), and can be easily evaluated from Theorem 2. In particular, 


if P + q, i^uyv = 0, and if p = q, 3 c K mv = 3 K U . V - 3 A h 


j frT 

j 

V’ a p-1 

3 \rT 

pIV j 

y v;a J p j 


(97) 
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where 


V xj, i A x — 


Jtt Jv 

1 0 1 ^ 3 

i x;a, i t x:ai 


3\r . 3\r . 

rV —7 . _7 7 JV 7 ? 




K 


We also note that 


K 


/ -n-7 / -n 3 

i a l ,a l _ 1 i ,a t 


-1 




#(p,g) = p A 


a P -i 

jf 

Z j 

a p -1 


Qp 

J' 


and E[ 


J j/ 


B(P; 3) = 


J J' J J> 

^ a A ^ il n 

L Og_l %_i «9 <4j 




Hence, taking the expectation with respect to the boundary conditions on both sides of Equation 
(96), we obtain: 


V U £ ^ 1’ “1“ 

P K ,■ 

J K , , 

1 

•t^O> 

C3 

1 '«■ 
ft. „ 

. ., e 'tad 
'tad -ET 

>A t 


a p, a q _ 1 

O'p't^Lq 



where J C K U]V is provided in Equation (97). 
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